Major Incident Management: Incident Commander and Runbook Practices

When a major incident hits (SEV-1/SEV-0), everyone wants to focus on the “technical fix”. But in big outages the thing you lose fastest is almost always coordination:

Two teams end up doing the same work in parallel
Everyone rallies around the wrong hypothesis
Customer comms slip
Risky changes get pushed into prod “real quick”

In this post I’ll lay out the approach that has worked for me on the ground — the Incident Commander (IC) role, the comms cadence, and runbook practices — to bring MTTR down in outages.

What does an Incident Commander actually do?

The IC’s job is not “be the best troubleshooter in the room”. The IC’s job is to:

Maintain a single picture of the situation (single source of truth)
Make priorities and decisions explicit
Run communications (internal and external)
Gate risky changes

The IC can be technical, but they’re far more effective once they hand the “debugger” role off to someone else.

Role split: IC + Tech Lead + Comms

Three roles cover most of what a major incident needs:

IC: Coordination, decisions, timeline
Tech Lead: Technical direction, hypotheses, fix plan
Comms Lead: Internal/external comms, status page, customer messaging

On a small team one person can carry two roles — but the responsibilities of all three should stay distinct.

A 10-minute cadence: the “status update” habit

Time perception goes sideways during an incident. The IC’s strongest tool is cadence.

What I push for: a short update every 10 minutes:

Current symptom (customer impact)
Strongest hypothesis (and why)
Actions taken (last 10 min)
Plan for the next 10 min
Any risky changes pending?

This rhythm keeps the team “in the same room”.

Runbook: not a “command list”, but a “decision tree”

A bad runbook is 200 lines of commands nobody reads. A good runbook is a decision tree:

Symptom → likely causes
Which metric/log do I confirm with?
“Safe” actions (reversible)
“Risky” actions (require IC approval)

The runbook’s goal isn’t to make you think like the strongest engineer on the team — it’s to get the average on-call engineer onto the right path.

Change control: how do you handle “deploy” during an incident?

Sometimes you have to make changes during an incident. Without a control framework, the risk balloons.

The simple rule set I use in the field:

Every deploy must be tied to a “hypothesis” (why would this fix it?)
No more than one big change at a time
No deploy unless rollback is ready
No “permanent” change in prod without IC approval

Communication: write the status page like a PR

A status update is more than saying “down”. A good status message:

States the impact precisely (which users, which region)
When it started (estimated)
Any workaround?
Time of the next update

Same discipline applies internally: don’t rely on Slack messages, keep a timeline.

Postmortem: not finding a culprit, fixing the system

The goal of a postmortem after a major incident:

Root cause (technical + process)
Detection gap (why did we notice late?)
Mitigation gap (why did we resolve slowly?)
Action items (owner + date)

The most valuable output here: an updated runbook and alarm set.

Closing: coordination is a technical competency

In a major incident, the “best engineer” doesn’t win — the best coordination wins.

A small first step:

Standardize the IC role and the 10-minute cadence
For each critical service, produce a one-page decision-tree runbook
Set an “IC approval” rule for risky changes

Major Incident Management: Incident Commander and Runbook Practices

What does an Incident Commander actually do?

Role split: IC + Tech Lead + Comms

A 10-minute cadence: the “status update” habit

Runbook: not a “command list”, but a “decision tree”

Change control: how do you handle “deploy” during an incident?

Communication: write the status page like a PR

Postmortem: not finding a culprit, fixing the system

Closing: coordination is a technical competency

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

The Decision Log and Handoff Discipline During Incident Rotation

Stabilization Sprint After Major Incidents (7 Days)

Service Ownership (RACI) for On-call and Change Clarity

What does an Incident Commander actually do?

Role split: IC + Tech Lead + Comms

A 10-minute cadence: the “status update” habit

Runbook: not a “command list”, but a “decision tree”

Change control: how do you handle “deploy” during an incident?

Communication: write the status page like a PR

Postmortem: not finding a culprit, fixing the system

Closing: coordination is a technical competency

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

The Decision Log and Handoff Discipline During Incident Rotation

Stabilization Sprint After Major Incidents (7 Days)

Service Ownership (RACI) for On-call and Change Clarity

Klavye Kısayolları