In most organizations on-call starts with “let someone hold the phone” and slowly turns into a cycle that produces burnout. But a well-designed on-call system not only responds to incidents; it also reduces the number of incidents and the cost of responding. The trick is not in the rotation schedule; it is in the escalation chain, alarm quality, and runbook discipline.
In this post I share the on-call design principles that work in the field, and a framework you can actually apply.
The goal of on-call: not “always be awake,” but “recover quickly”
You cannot improve on-call without measuring its three outputs:
- MTTA/MTTR: time to acknowledge and time to recover
- Alarm quality: actionability rate (how many require response vs. how many do not)
- Toil: repetitive manual tasks and night-time interventions
Rotation: fairness + sustainability
A practical starting template:
- Primary: starts the response
- Secondary: steps in when needed, backs up knowledge/experience
- Incident Commander (ops lead): manages decisions and communication on P1/P0 (not always on-call)
I keep two rules constant in rotation design:
- Back-to-back on-call kept at the minimum (sleep debt)
- The secondary role is not “just standby”; it is also learning and load sharing
Escalation chain: durations and ownership must be clear
The single goal for escalation: “no alarm should be left in limbo.”
Example chain:
- Primary: ack within 5 min
- Secondary: +5 min
- IC / Team Lead: +10 min
- Wide call (war room): +15 min (P1 criterion)
These durations are tuned to the system’s criticality and team size. But whatever the duration, it must be written down and automated.
Alarm quality: lives together with the runbook
Minimum content standard for an actionable alarm:
- What broke? (SLO/SLI)
- What is the impact? (user, revenue, critical process)
- The first 3 check steps (runbook link)
- Rollback plan / feature flag info (if any)
If there is no runbook in the alarm, on-call turns into a “search engine.”
Runbook discipline: short, clear, actionable
Good runbook format:
- Triage: 3–5 quick checks (dashboard/links)
- Mitigation: safe first moves (rate-limit, rollback, failover)
- Escalation: who is called and when
- Evidence: which logs/metrics are saved, what is collected for the postmortem
Runbooks are not a documentation archive; they are living operations. If the runbook is not updated after every P1, you will repeat the same mistake.
Reducing pager fatigue: the toil-budget approach
The fastest improvement I have seen in the field is to set a “toil budget”:
- Specific weekly hours: alarm tuning + automation
- A “top 5 most-paging alarms” list
- For each alarm: cause, action, fix owner, target date
Once this discipline takes hold, on-call stops being a “crisis shift” and turns into a feedback mechanism that improves the reliability of the system.
Conclusion
Designed correctly, on-call is not an obligation that wears teams down; it becomes a practice that builds operational maturity. When fairness in rotation, clarity in escalation, alarm quality, and runbook discipline work together, MTTR drops and the team’s capacity to “stay calm” rises. The most invisible yet most critical contribution of operational leadership is keeping this whole system sustainable.