In Platform/SRE/Infra roles, the biggest risk in an interview is this: the candidate talks “theory” and you let yourself be convinced by “theory”. Then the first real production incident hits and the reflexes for handling uncertainty, reading signals, making controlled risky changes, and communicating just aren’t there.
Because of that, I always run an incident walkthrough in interviews. The candidate walks through a real (or constructed) incident and explains, step by step, how they ran it. Set up correctly, this surfaces the operational muscles that don’t show up on a CV.
What is an incident walkthrough, and what does it measure?
A walkthrough lets you measure:
- Signal reading: separating metric/log/trace/network signals
- Hypothesis building: moving from “most likely” to “most risky”
- Risk management: the balance between fast and safe
- Communication: stakeholder management, incident command, status updates
- Learning: postmortem, durable mitigation, repeat prevention
Structure: a 30-minute ideal flow
The flow I prefer:
- Context (2 min): what’s the system, what’s the critical path, what’s the SLO?
- Symptom (3 min): what did the alarm say, what did the user experience?
- Triage (8 min): in the first 10 minutes, what did you do and what did you check?
- Response (7 min): which change did you make, and how did you manage the risk?
- Communication (5 min): who did you keep informed, and how?
- Closure (5 min): postmortem, action items, measurement
At each step you ask the candidate for “evidence”:
- Which metric?
- Which class of log line?
- Which dashboard?
- Which runbook?
Strong signals (for me)
1) They speak in “SLO language”
Instead of “the service was slow”:
- p95/p99, error rate
- Which user segment got hit
- Blast radius (is it one region?)
2) They line up their hypotheses
A strong candidate does this:
- Fast, low-risk checks first
- More invasive steps later
A typical sequence:
- Was there a recent deploy/config change?
- Are dependencies healthy?
- Any saturation signal (CPU, queue, conntrack, pool)?
- Is a controlled degrade or traffic shed possible?
3) The “rollback” reflex and a decision threshold
A good candidate doesn’t just say “I’d roll back” — they say:
- “On which signal I roll back”
- “How I verify after rollback”
- “What conditions need to hold before I roll forward again”
4) Communication quality
This is where many technical candidates fall short. What I look for:
- Crisp status updates (what we know, what we don’t)
- A single decision owner (incident commander)
- Separating “noise” from “signal”
Weak signals (red flags)
- Answering every question with “autoscale”
- Calling it “the network” without looking at metrics
- “We’ll just try it in prod” attitude toward risky changes
- Treating the postmortem as a “report” (no actions, no mitigation)
Simple scoring rubric (practical)
I generally use a 5-point rubric:
| Dimension | 1 | 3 | 5 |
|---|---|---|---|
| Signal reading | random | basic metrics | right metric set + correlation |
| Hypothesis | stuck on one idea | a few hypotheses | ordered, evidence-backed, iterative |
| Risk management | uncontrolled changes | cautious | canary/rollback/guardrail |
| Communication | none | basic | IC model + regular updates |
| Learning | no postmortem | partial | durable mitigation + measurement |
For the candidate: how should you choose “the incident”?
My advice for candidates:
- You don’t have to pick a giant outage — what matters is the decision points.
- “We managed it as a team” reads more honestly than “I solved it.”
- Talk about 1–2 of the postmortem action items.
Wrap-up
In a platform interview, an incident walkthrough is one of the lowest-cost, highest-signal ways to gauge a candidate’s real production reflex. Set it up right, score it with a rubric, and you avoid losing strong candidates while reducing the “looks great in theory, struggles in practice” mismatches.