İçeriğe Atla
Mustafa Erbay
Career · 9 min read · görüntülenme Türkçe oku
100%

The Decision Log and Handoff Discipline During Incident Rotation

How a decision log, a steady handover rhythm, and a clean handoff flow keep context from getting lost when teams swap during long-running outages.

The Decision Log and Handoff Discipline During Incident Rotation — cover image

In long-running incidents, the most overlooked risk isn’t the technical fault itself — it’s context loss. Teams change, fatigue piles on, and people forget which hypothesis got ruled out and why. The new team retries the same experiments, can’t see why a risky decision was made the way it was, and the incident drags on longer than it has to.

That’s why on big incidents you don’t just need a status update — you need a decision log. The quality of a handoff often determines safety and communication just as much as it determines MTTR.

1) What is a decision log?

A decision log is part of the incident timeline, but it’s more selective. It carries:

  • Which hypothesis got considered?
  • What evidence turned up?
  • What action was taken?
  • Why was it taken?
  • What risk was accepted?

With this record, the new team gets to see why we went down that path instead of just a list of “things we did.”

2) When do you need a handoff?

A formal handoff is mandatory in scenarios like these:

  • The incident is rolling into a shift change
  • The IC or tech lead is hitting their fatigue threshold
  • A team from a different specialty is taking over
  • The incident has moved into a “stable but not resolved” state

A common mistake here is letting the handoff drift along freely in Slack. Random messages show you the past but don’t carry the decision logic.

3) A solid handoff template

I’ve found these five sections to be really useful:

  1. Current impact: which users, which services?
  2. Working hypothesis: what’s the strongest explanation right now?
  3. Evidence: signals that support that hypothesis
  4. Things tried: changes that were attempted and what came of them
  5. Boundaries: risky actions that should not be taken

This five-pointer drops the new team straight onto the decision surface.

4) What do I specifically write into the decision log?

  • If we didn’t roll back, why?
  • What message went out to customers?
  • Which data source did we decide we couldn’t trust?
  • Which actions can only be taken with IC approval?

Because as an incident drags on, teams stop forgetting the technical details and start forgetting the decision context.

5) Classic mistakes during handover

  • Using vague phrases like “we looked at the logs”
  • Writing only the command instead of the result of the experiment
  • Leaving a risky change as a verbal note
  • Not making it clear who’s in charge of decisions after the handoff

At the end of every handoff, there has to be a single, explicit ownership statement: “From this point on, coordination is on X, and the technical recommendation sits with Y.”

6) It’s a habit, not a tool

You can keep the decision log in a wiki, in your incident tool, or in a shared doc. The tool is secondary. What actually moves the needle is the habit:

  • A short note after every important decision
  • A status summary every 10–15 minutes
  • A final risk list before every handoff

That cadence keeps the system calm when an incident drags on. Instead of relying on human memory, it makes the decisions visible.

Conclusion

In long-running incidents, what determines quality isn’t just good engineering — it’s good handover discipline. With a decision log and a clean handoff flow, the new team doesn’t keep slamming into the same wall, the risky actions stay visible, and coordination doesn’t fall apart. Professionalism in incident management often comes from “cleaner context handover,” not “more people on the call.”

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

Frequently Asked Questions

Common questions readers have about this article.

How do I actually start a decision log without adding more toil during a high-pressure incident?
I’ve seen teams try to use complex spreadsheets or heavy templates, but that always fails when things get heated. In my experience, the best way to start is by using what you already have—usually a dedicated thread in Slack or a pinned section in a shared document. I personally prefer a simple 'Decision Log' header in a live document like Notion or Google Docs. The key is brevity; I tell my teams to focus strictly on the 'why.' If we decided to reboot a database instead of scaling it, that reasoning needs to be captured in one sentence. It’s not about writing a novel; it’s about leaving a breadcrumb trail so the next person doesn't waste two hours re-running a failed hypothesis. Start small, but stay consistent.
Isn't stopping to document decisions in a log just slowing down the actual recovery time (MTTR)?
This is a classic trap I’ve fallen into myself. You feel like every second spent typing is a second the system is down. However, I’ve learned that the 'speed' you gain by skipping documentation is an illusion. Without a log, the incoming shift inevitably spends 45 minutes asking the same questions you already answered. I view the decision log as an investment in the incident's 'future self.' By spending 30 seconds to log a pivot, I’m preventing a 30-minute regression later. In long-running outages, the bottleneck isn't usually typing speed; it's the cognitive load of keeping the entire state in your head. A log offloads that burden so you can actually think faster and make better choices under pressure.
What should I do when a handoff happens but the incoming team still feels lost or starts repeating work?
I’ve been on both sides of a 'bad' handoff where the context just didn't click. Usually, this happens because the handoff was a 'data dump' rather than a 'context transfer.' If I see the new team repeating experiments, I immediately pause the technical work for five minutes to re-sync. I’ve found that a verbal walkthrough of the decision log is crucial—you can't just drop a link and walk away. I make it a rule that the outgoing Lead must stay online for a 15-minute 'overlap' period. If they still feel lost, it’s a sign our log lacks the 'rejected hypotheses' section. Knowing what didn't work and why is often more valuable for the new team than knowing what we're currently trying.
Can't we just rely on the Slack history as our decision log since everything is already recorded there?
I hear this all the time, and frankly, it’s a dangerous myth. Slack is a stream of consciousness, not a record of intent. I’ve tried scrolling through 500 messages during a shift change to find out why a specific configuration was changed, and it’s a nightmare. You get buried in 'noise'—automated alerts, side discussions, and typos. A decision log is an intentional, curated summary that sits above the chat. I believe that if you can't see the strategic pivots of the last four hours in a single screen, you don't have a log; you just have an archive. Real discipline means extracting the signal from the Slack noise so the next team can hit the ground running without needing a history degree.
ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts