İçeriğe Atla
Mustafa Erbay
Career · 12 min read · görüntülenme Türkçe oku
100%

Major Incident Management: Incident Commander and Runbook Practices

In big outages the largest risk isn't technical, it's coordination. How I drive MTTR down with the IC role, a steady comms cadence, and a practical runbook…

Major Incident Management: Incident Commander and Runbook Practices — cover image

When a major incident hits (SEV-1/SEV-0), everyone wants to focus on the “technical fix”. But in big outages the thing you lose fastest is almost always coordination:

  • Two teams end up doing the same work in parallel
  • Everyone rallies around the wrong hypothesis
  • Customer comms slip
  • Risky changes get pushed into prod “real quick”

In this post I’ll lay out the approach that has worked for me on the ground — the Incident Commander (IC) role, the comms cadence, and runbook practices — to bring MTTR down in outages.

What does an Incident Commander actually do?

The IC’s job is not “be the best troubleshooter in the room”. The IC’s job is to:

  • Maintain a single picture of the situation (single source of truth)
  • Make priorities and decisions explicit
  • Run communications (internal and external)
  • Gate risky changes

The IC can be technical, but they’re far more effective once they hand the “debugger” role off to someone else.

Role split: IC + Tech Lead + Comms

Three roles cover most of what a major incident needs:

  • IC: Coordination, decisions, timeline
  • Tech Lead: Technical direction, hypotheses, fix plan
  • Comms Lead: Internal/external comms, status page, customer messaging

On a small team one person can carry two roles — but the responsibilities of all three should stay distinct.

A 10-minute cadence: the “status update” habit

Time perception goes sideways during an incident. The IC’s strongest tool is cadence.

What I push for: a short update every 10 minutes:

  • Current symptom (customer impact)
  • Strongest hypothesis (and why)
  • Actions taken (last 10 min)
  • Plan for the next 10 min
  • Any risky changes pending?

This rhythm keeps the team “in the same room”.

Runbook: not a “command list”, but a “decision tree”

A bad runbook is 200 lines of commands nobody reads. A good runbook is a decision tree:

  • Symptom → likely causes
  • Which metric/log do I confirm with?
  • “Safe” actions (reversible)
  • “Risky” actions (require IC approval)

The runbook’s goal isn’t to make you think like the strongest engineer on the team — it’s to get the average on-call engineer onto the right path.

Change control: how do you handle “deploy” during an incident?

Sometimes you have to make changes during an incident. Without a control framework, the risk balloons.

The simple rule set I use in the field:

  • Every deploy must be tied to a “hypothesis” (why would this fix it?)
  • No more than one big change at a time
  • No deploy unless rollback is ready
  • No “permanent” change in prod without IC approval

Communication: write the status page like a PR

A status update is more than saying “down”. A good status message:

  • States the impact precisely (which users, which region)
  • When it started (estimated)
  • Any workaround?
  • Time of the next update

Same discipline applies internally: don’t rely on Slack messages, keep a timeline.

Postmortem: not finding a culprit, fixing the system

The goal of a postmortem after a major incident:

  • Root cause (technical + process)
  • Detection gap (why did we notice late?)
  • Mitigation gap (why did we resolve slowly?)
  • Action items (owner + date)

The most valuable output here: an updated runbook and alarm set.

Closing: coordination is a technical competency

In a major incident, the “best engineer” doesn’t win — the best coordination wins.

A small first step:

  • Standardize the IC role and the 10-minute cadence
  • For each critical service, produce a one-page decision-tree runbook
  • Set an “IC approval” rule for risky changes
Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts