İçeriğe Atla
Mustafa Erbay
Career · 11 min read · görüntülenme Türkçe oku
100%

Incident Walkthrough and Operational Signals in a Platform Interview

An incident walkthrough framework and scoring rubric for measuring a candidate's real production reflex in SRE/Platform/Infra interviews.

Incident Walkthrough and Operational Signals in a Platform Interview — cover image

In Platform/SRE/Infra roles, the biggest risk in an interview is this: the candidate talks “theory” and you let yourself be convinced by “theory”. Then the first real production incident hits and the reflexes for handling uncertainty, reading signals, making controlled risky changes, and communicating just aren’t there.

Because of that, I always run an incident walkthrough in interviews. The candidate walks through a real (or constructed) incident and explains, step by step, how they ran it. Set up correctly, this surfaces the operational muscles that don’t show up on a CV.

What is an incident walkthrough, and what does it measure?

A walkthrough lets you measure:

  • Signal reading: separating metric/log/trace/network signals
  • Hypothesis building: moving from “most likely” to “most risky”
  • Risk management: the balance between fast and safe
  • Communication: stakeholder management, incident command, status updates
  • Learning: postmortem, durable mitigation, repeat prevention

Structure: a 30-minute ideal flow

The flow I prefer:

  1. Context (2 min): what’s the system, what’s the critical path, what’s the SLO?
  2. Symptom (3 min): what did the alarm say, what did the user experience?
  3. Triage (8 min): in the first 10 minutes, what did you do and what did you check?
  4. Response (7 min): which change did you make, and how did you manage the risk?
  5. Communication (5 min): who did you keep informed, and how?
  6. Closure (5 min): postmortem, action items, measurement

At each step you ask the candidate for “evidence”:

  • Which metric?
  • Which class of log line?
  • Which dashboard?
  • Which runbook?

Strong signals (for me)

1) They speak in “SLO language”

Instead of “the service was slow”:

  • p95/p99, error rate
  • Which user segment got hit
  • Blast radius (is it one region?)

2) They line up their hypotheses

A strong candidate does this:

  • Fast, low-risk checks first
  • More invasive steps later

A typical sequence:

  1. Was there a recent deploy/config change?
  2. Are dependencies healthy?
  3. Any saturation signal (CPU, queue, conntrack, pool)?
  4. Is a controlled degrade or traffic shed possible?

3) The “rollback” reflex and a decision threshold

A good candidate doesn’t just say “I’d roll back” — they say:

  • “On which signal I roll back”
  • “How I verify after rollback”
  • “What conditions need to hold before I roll forward again”

4) Communication quality

This is where many technical candidates fall short. What I look for:

  • Crisp status updates (what we know, what we don’t)
  • A single decision owner (incident commander)
  • Separating “noise” from “signal”

Weak signals (red flags)

  • Answering every question with “autoscale”
  • Calling it “the network” without looking at metrics
  • “We’ll just try it in prod” attitude toward risky changes
  • Treating the postmortem as a “report” (no actions, no mitigation)

Simple scoring rubric (practical)

I generally use a 5-point rubric:

Dimension135
Signal readingrandombasic metricsright metric set + correlation
Hypothesisstuck on one ideaa few hypothesesordered, evidence-backed, iterative
Risk managementuncontrolled changescautiouscanary/rollback/guardrail
CommunicationnonebasicIC model + regular updates
Learningno postmortempartialdurable mitigation + measurement

For the candidate: how should you choose “the incident”?

My advice for candidates:

  • You don’t have to pick a giant outage — what matters is the decision points.
  • “We managed it as a team” reads more honestly than “I solved it.”
  • Talk about 1–2 of the postmortem action items.

Wrap-up

In a platform interview, an incident walkthrough is one of the lowest-cost, highest-signal ways to gauge a candidate’s real production reflex. Set it up right, score it with a rubric, and you avoid losing strong candidates while reducing the “looks great in theory, struggles in practice” mismatches.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts