In most teams the word “runbook” either does not exist at all, or it means “a long document nobody reads.” Yet the goal of a good runbook is not to store knowledge, but to produce decisions during an incident.
What works best for me in the field: bring the runbook down to a “Minimum Viable” level, but always include decision thresholds in it. Otherwise the runbook turns into a chronicle text that saves no one.
1) What is a Minimum Viable Runbook (MVR)?
An MVR is good enough if it answers these questions in 3 minutes:
- What does this alarm mean? (impact)
- Which evidence do we collect in the first 5 minutes? (triage)
- At which point is which action taken? (threshold/decision)
- After response, how do we verify? (verification)
- If things go wrong, how do we roll back? (rollback)
2) Template: a runbook on a single page
You can copy the template below as is:
Title
- Service / component name
- Alarm name and severity (P1/P2)
- Owning team and escalation channel
Impact statement
- User impact (what breaks?)
- Blast radius (which regions/tenants?)
- SLO/SLI (which metric is violated?)
Triage (0–5 minutes)
- First dashboard links to check
- First log queries to check
- Evidence list (must be collected)
Decision thresholds (the most critical section)
Examples:
- If error rate > 5% and latency p95 > 2s → reduce traffic
- If DB connection wait > 1s → lower retries and apply concurrency limit
- If it started after a deploy → consider rollback
Mitigation ladder (low risk → high risk)
- Reduce traffic (rate limit / degrade)
- Reduce pressure with cache/queue
- Rollback / turn off feature flag
- Failover / region isolation
Verification
- Once which metric returns to normal do we say “incident over”?
- How long do we observe? (e.g. 15 min)
Rollback
- One command / one PR / one toggle
- Verification after the rollback
Communication
- Status update cadence (e.g. 15 min)
- Who gets informed (internal/external)
3) Clarifying decision points: this is what calms teams down
The “leadership” side of a runbook starts here: reducing uncertainty.
Practical decision questions:
- “Are we losing customers right now, or is this only a signal?”
- “If we revert this change, does that create a bigger risk?”
- “If we throttle traffic, which users are impacted, and which are protected?”
The most common mistake:
- Trying everything at the same time.
The best reflex:
- A loop of one hypothesis → one intervention → one verification.
4) Keep the runbook alive: drill and update cadence
To keep the MVR from dying, two simple rules:
- After every P1/P2, a 10-minute “patch” is applied to the runbook.
- Once a month, a small 30-minute drill is run (even just triage is enough).
5) Closing: the runbook produces operational calm
Organizations are measured not on their best days, but on their worst. The point of the runbook during an incident is not to create heroes; it is to make sure the team speaks the same language, decides on the same thresholds, and recovers faster with less panic.