There are two extremes during an incident: either there is no data at all, or you drown the system in logs and pcap by trying to “collect everything.” Both end the same way: in the postmortem, the question “what exactly happened?” gets answered with interpretation rather than evidence.
What works best for me in the field: an Evidence Collection Kit + role assignment. The two together don’t just put out the fire; they also make the next incident cheaper.
What are we aiming for?
Not killing the evidence while resolving the incident
After the event, making improvements based on evidence, not luck
Keeping the initial data ready in case a security (forensics) need arises
Role assignment: three people, three responsibilities
Even in small teams, this separation pays off:
Incident Commander (IC): makes decisions, sets priorities, manages communication
Scribe: keeps the timeline, records the actions taken and their outcomes
Evidence Owner: runs the evidence checklist (logs/snapshots/configs)
When all three roles are merged into one person, the evidence side is usually the part that gets lost.
Evidence Collection Kit: “minimal but reproducible”
I split the kit into 6 pieces. For each piece, the answer to “why does this exist?” is clear.
1) Time standard
NTP/Chrony health check on all systems
During the incident, all records collected in UTC or one single time standard
Application: feature flag state, env/secret version (the version, not the value)
Infra: LB pool members, weights, health-check state
How does it plug into the incident flow?
The most practical model:
The IC makes the containment decision in the first 5 minutes
The Scribe writes every action as “timestamp + command + result”
The Evidence Owner finishes the kit at the first calm moment after containment
This way, “putting out the fire” and “producing evidence” run at the same time.
Conclusion
The evidence collection kit does not slow down an incident; when set up correctly, it speeds it up. Because in the panic moment, the team does not stop to ask “what should we collect?” — the standard runs on its own. Operational leadership starts here: designing not only the system, but the system that manages the event itself.
Paylaş:
Bu yazı faydalı oldu mu?
Yükleniyor...
Geri bildiriminiz için teşekkürler!
Bu yazı nasıldı?
Frequently Asked Questions
Common questions readers have about this article.
How do I realistically implement these three roles in a small team of only two or three engineers?
I often get asked if this structure is overkill for tiny teams. In my experience, it is actually more critical there because the risk of 'tunnel vision' is much higher. You do not necessarily need three separate physical people; you need three distinct mindsets. When I’m in a small group, I designate one person as the Incident Commander who also handles the Scribe duties, while the engineer actually touching the systems acts as the Evidence Owner. The goal is to stop the 'cowboy' fix where someone changes a config, restarts a service, and then realizes they have no idea what the original state was. By assigning these responsibilities explicitly, I ensure that even in a rush, someone is responsible for hitting 'copy' on that log file or taking a snapshot before the service is wiped.
Doesn't the process of collecting evidence significantly delay the 'Return to Service' during a critical outage?
There’s a common fear that evidence collection delays recovery, but I’ve seen the opposite play out many times. Someone skips a snapshot to save five minutes, the fix fails, and now they’ve lost the original state and spend five hours rebuilding from scratch. I treat evidence collection as a core part of the recovery process, not a side quest. My rule of thumb is 'collect while you investigate.' If you are already looking at a log, pipe it to a persistent file immediately. If you are about to change a configuration, back it up. This 'minimal but correct' approach actually speeds up recovery because it provides a solid ground to retreat to if your initial hypothesis is wrong. It turns a guessing game into a controlled operation.
What is the most common mistake you see regarding time standards during an active incident?
The biggest trap I see in the field is ignoring the time standard until the postmortem starts. I have spent countless frustrating hours trying to correlate logs where the application server was in UTC, the database was in local time, and the load balancer was skewed by forty seconds. I always start my incident response by verifying NTP or Chrony health across the fleet. If you cannot trust your timestamps, your evidence is just a pile of disconnected stories that don't line up. I insist on UTC across the board for all records. It is not just a technical preference; it is a survival mechanism. When you are under extreme pressure, you do not want to be doing mental math to figure out if a database lock happened before or after a specific API call.
Is it really necessary to use a full Evidence Collection Kit for minor performance glitches?
People often tell me, 'This is just a minor performance glitch, we don't need a forensic kit.' That is a dangerous myth. I have seen many 'performance issues' that were actually signs of a data exfiltration attempt or a misconfigured scraper hitting a sensitive endpoint. If you treat every incident as a potential security event, you build a professional muscle for data integrity. My Evidence Collection Kit is designed to be lightweight enough for a routine memory leak but robust enough for a breach investigation. By the time you realize an incident actually *is* a security event, it is usually too late to start collecting the data you need. Start early, keep it minimal, and you will never be caught empty-handed during a postmortem, regardless of the incident's root cause.
ME
Mustafa Erbay
Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım
2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği
ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.
Kişisel Notlar
Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.
Hazır0 karakter
Comments
Server-side AI Moderation
Comments are AI-moderated server-side and stored permanently.
?
0/2000
Server-side AI moderation
No comments yet. Be the first!
✉️Free · No spam · Unsubscribe anytime
Curated digest, hand-picked by me — not the AI
Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.
📌
Best of the weekSingle most-worth-reading post
🔧
Toolbox notesReal tools I used this week
🧠
Behind-the-scenesNotes that don't make it to blog
We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).