When a major incident hits (SEV-1/SEV-0), everyone wants to focus on the “technical fix”. But in big outages the thing you lose fastest is almost always coordination:
- Two teams end up doing the same work in parallel
- Everyone rallies around the wrong hypothesis
- Customer comms slip
- Risky changes get pushed into prod “real quick”
In this post I’ll lay out the approach that has worked for me on the ground — the Incident Commander (IC) role, the comms cadence, and runbook practices — to bring MTTR down in outages.
What does an Incident Commander actually do?
The IC’s job is not “be the best troubleshooter in the room”. The IC’s job is to:
- Maintain a single picture of the situation (single source of truth)
- Make priorities and decisions explicit
- Run communications (internal and external)
- Gate risky changes
The IC can be technical, but they’re far more effective once they hand the “debugger” role off to someone else.
Role split: IC + Tech Lead + Comms
Three roles cover most of what a major incident needs:
- IC: Coordination, decisions, timeline
- Tech Lead: Technical direction, hypotheses, fix plan
- Comms Lead: Internal/external comms, status page, customer messaging
On a small team one person can carry two roles — but the responsibilities of all three should stay distinct.
A 10-minute cadence: the “status update” habit
Time perception goes sideways during an incident. The IC’s strongest tool is cadence.
What I push for: a short update every 10 minutes:
- Current symptom (customer impact)
- Strongest hypothesis (and why)
- Actions taken (last 10 min)
- Plan for the next 10 min
- Any risky changes pending?
This rhythm keeps the team “in the same room”.
Runbook: not a “command list”, but a “decision tree”
A bad runbook is 200 lines of commands nobody reads. A good runbook is a decision tree:
- Symptom → likely causes
- Which metric/log do I confirm with?
- “Safe” actions (reversible)
- “Risky” actions (require IC approval)
The runbook’s goal isn’t to make you think like the strongest engineer on the team — it’s to get the average on-call engineer onto the right path.
Change control: how do you handle “deploy” during an incident?
Sometimes you have to make changes during an incident. Without a control framework, the risk balloons.
The simple rule set I use in the field:
- Every deploy must be tied to a “hypothesis” (why would this fix it?)
- No more than one big change at a time
- No deploy unless rollback is ready
- No “permanent” change in prod without IC approval
Communication: write the status page like a PR
A status update is more than saying “down”. A good status message:
- States the impact precisely (which users, which region)
- When it started (estimated)
- Any workaround?
- Time of the next update
Same discipline applies internally: don’t rely on Slack messages, keep a timeline.
Postmortem: not finding a culprit, fixing the system
The goal of a postmortem after a major incident:
- Root cause (technical + process)
- Detection gap (why did we notice late?)
- Mitigation gap (why did we resolve slowly?)
- Action items (owner + date)
The most valuable output here: an updated runbook and alarm set.
Closing: coordination is a technical competency
In a major incident, the “best engineer” doesn’t win — the best coordination wins.
A small first step:
- Standardize the IC role and the 10-minute cadence
- For each critical service, produce a one-page decision-tree runbook
- Set an “IC approval” rule for risky changes