No matter how good a change plan looks on paper, things rarely go “as planned” in production. The real problem isn’t the failures that show up after a change; it’s how fast and how correctly the team responds when one does. That’s why on critical changes I value the “pre-mortem” discipline just as much as the “postmortem.”
The aim of a pre-mortem is simple: assume “the change failed,” live through the failure and its impact in your head ahead of time, and make the risks visible.
When should you run a pre-mortem?
Holding a meeting for every deploy isn’t sustainable. Use pre-mortems for changes like these:
- Wide blast surface: shared layers like platform, network, identity, logging, DNS
- Hard to roll back: data schema, stateful service, security policy
- First-time work: a new technology/vendor/operating model
- Touches production access or permission changes
A practical 30-minute flow
I try to fit a pre-mortem into 30 minutes:
- Goal (3 min): What business outcome is this change trying to deliver?
- “It failed” scenario (10 min): Pick the 3 worst-case ways this ends badly.
- Early signals (7 min): Which metric/log/alert catches it first?
- Rollback (7 min): 1) automatic 2) manual 3) “stop and isolate”
- Decision points (3 min): At what threshold is rollback mandatory?
Template: the same questions for every change
I keep these questions fixed:
- Blast radius: What’s the worst-case impact area?
- Dependencies: Which service/layer gets quietly affected?
- Observability: Which signal shows success and which shows breakage?
- Authority: Who can roll back? Is there a break-glass?
- Data: Is there a data consistency risk on rollback?
- Timing: Are clock skew / TTL / cache effects in play during the change?
- Communication: Is it clear who gets notified on which channel?
These questions exist to speed up decisions, not to “produce documents.”
How should you use the pre-mortem output?
The best outputs:
- A “risk and rollback” section added to the change RFC
- Decision points added to the runbook (threshold + action)
- Closing observability/alarm gaps (before deploy)
The worst output: holding the meeting and never writing anything down.
Leadership angle: a pre-mortem is a trust-building exercise
A pre-mortem isn’t a “I don’t trust the team” message; on the contrary, it’s how you produce safe speed without piling pressure on the team. A good leader doesn’t hide risks; they make them visible. Because real speed in production is the ability to catch failure early and roll back correctly.
Closing
There’s no perfect plan in production; only a well-prepared rollback. A pre-mortem is a small investment before the change and a big time saver after it. Done with discipline, the team isn’t “brave” — it’s in control.