Incident Walkthrough and Operational Signals in a Platform Interview

In Platform/SRE/Infra roles, the biggest risk in an interview is this: the candidate talks “theory” and you let yourself be convinced by “theory”. Then the first real production incident hits and the reflexes for handling uncertainty, reading signals, making controlled risky changes, and communicating just aren’t there.

Because of that, I always run an incident walkthrough in interviews. The candidate walks through a real (or constructed) incident and explains, step by step, how they ran it. Set up correctly, this surfaces the operational muscles that don’t show up on a CV.

What is an incident walkthrough, and what does it measure?

A walkthrough lets you measure:

Signal reading: separating metric/log/trace/network signals
Hypothesis building: moving from “most likely” to “most risky”
Risk management: the balance between fast and safe
Communication: stakeholder management, incident command, status updates
Learning: postmortem, durable mitigation, repeat prevention

Structure: a 30-minute ideal flow

The flow I prefer:

Context (2 min): what’s the system, what’s the critical path, what’s the SLO?
Symptom (3 min): what did the alarm say, what did the user experience?
Triage (8 min): in the first 10 minutes, what did you do and what did you check?
Response (7 min): which change did you make, and how did you manage the risk?
Communication (5 min): who did you keep informed, and how?
Closure (5 min): postmortem, action items, measurement

At each step you ask the candidate for “evidence”:

Which metric?
Which class of log line?
Which dashboard?
Which runbook?

Strong signals (for me)

1) They speak in “SLO language”

Instead of “the service was slow”:

p95/p99, error rate
Which user segment got hit
Blast radius (is it one region?)

2) They line up their hypotheses

A strong candidate does this:

Fast, low-risk checks first
More invasive steps later

A typical sequence:

Was there a recent deploy/config change?
Are dependencies healthy?
Any saturation signal (CPU, queue, conntrack, pool)?
Is a controlled degrade or traffic shed possible?

3) The “rollback” reflex and a decision threshold

A good candidate doesn’t just say “I’d roll back” — they say:

“On which signal I roll back”
“How I verify after rollback”
“What conditions need to hold before I roll forward again”

4) Communication quality

This is where many technical candidates fall short. What I look for:

Crisp status updates (what we know, what we don’t)
A single decision owner (incident commander)
Separating “noise” from “signal”

Weak signals (red flags)

Answering every question with “autoscale”
Calling it “the network” without looking at metrics
“We’ll just try it in prod” attitude toward risky changes
Treating the postmortem as a “report” (no actions, no mitigation)

Simple scoring rubric (practical)

I generally use a 5-point rubric:

Dimension	1	3	5
Signal reading	random	basic metrics	right metric set + correlation
Hypothesis	stuck on one idea	a few hypotheses	ordered, evidence-backed, iterative
Risk management	uncontrolled changes	cautious	canary/rollback/guardrail
Communication	none	basic	IC model + regular updates
Learning	no postmortem	partial	durable mitigation + measurement

For the candidate: how should you choose “the incident”?

My advice for candidates:

You don’t have to pick a giant outage — what matters is the decision points.
“We managed it as a team” reads more honestly than “I solved it.”
Talk about 1–2 of the postmortem action items.

Wrap-up

In a platform interview, an incident walkthrough is one of the lowest-cost, highest-signal ways to gauge a candidate’s real production reflex. Set it up right, score it with a rubric, and you avoid losing strong candidates while reducing the “looks great in theory, struggles in practice” mismatches.

Incident Walkthrough and Operational Signals in a Platform Interview

What is an incident walkthrough, and what does it measure?

Structure: a 30-minute ideal flow

Strong signals (for me)

1) They speak in “SLO language”

2) They line up their hypotheses

3) The “rollback” reflex and a decision threshold

4) Communication quality

Weak signals (red flags)

Simple scoring rubric (practical)

For the candidate: how should you choose “the incident”?

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Managing Operational Debt with a Toil Budget

Designing Pre-Incident Drill Narratives for Technical Leaders

A Tacit Knowledge Inventory Cadence for Senior Engineers

What is an incident walkthrough, and what does it measure?

Structure: a 30-minute ideal flow

Strong signals (for me)

1) They speak in “SLO language”

2) They line up their hypotheses

3) The “rollback” reflex and a decision threshold

4) Communication quality

Weak signals (red flags)

Simple scoring rubric (practical)

For the candidate: how should you choose “the incident”?

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Managing Operational Debt with a Toil Budget

Designing Pre-Incident Drill Narratives for Technical Leaders

A Tacit Knowledge Inventory Cadence for Senior Engineers

Klavye Kısayolları