Career April 13, 2026 · 6 min read · … görüntülenme Türkçe oku

100%

Evidence Collection Kit and Roles During an Incident

An evidence set, time standard, role assignment, and practical checklist to break the panic-driven 'SSH into one server' reflex.

#operations #security #incident #leadership #runbook #observability

Evidence Collection Kit and Roles During an Incident — cover image

There are two extremes during an incident: either there is no data at all, or you drown the system in logs and pcap by trying to “collect everything.” Both end the same way: in the postmortem, the question “what exactly happened?” gets answered with interpretation rather than evidence.

What works best for me in the field: an Evidence Collection Kit + role assignment. The two together don’t just put out the fire; they also make the next incident cheaper.

What are we aiming for?

Not killing the evidence while resolving the incident
After the event, making improvements based on evidence, not luck
Keeping the initial data ready in case a security (forensics) need arises

Role assignment: three people, three responsibilities

Even in small teams, this separation pays off:

Incident Commander (IC): makes decisions, sets priorities, manages communication
Scribe: keeps the timeline, records the actions taken and their outcomes
Evidence Owner: runs the evidence checklist (logs/snapshots/configs)

When all three roles are merged into one person, the evidence side is usually the part that gets lost.

Evidence Collection Kit: “minimal but reproducible”

I split the kit into 6 pieces. For each piece, the answer to “why does this exist?” is clear.

1) Time standard

NTP/Chrony health check on all systems
During the incident, all records collected in UTC or one single time standard

Quick check:

timedatectl status
chronyc tracking 2>/dev/null || true

2) Change evidence

The most common root cause: “something just changed.”

Collect:

Deploy/CI/CD log (commit SHA, pipeline id)
Config management changes
Firewall/LB/policy changes

3) Access evidence

Collect:

Bastion/SSO audit
Privileged command records (auditd, not shell history; audit)
Break-glass usage (who, when, why)

4) Service evidence

Collect:

Error rate, latency, saturation (SLO/SLI)
The most critical dashboard screenshot or export
If there is an alarm storm, a note of “which are symptoms, which are causes”

5) Host evidence

Collect:

Baseline CPU/memory/disk/net metrics
Kernel logs from the last 30–60 minutes
Single-line critical signals such as OOM, conntrack table full, filesystem error

Practical:

uptime
free -m
df -h
dmesg -T | tail -n 200
journalctl --since "-60m" --no-pager | tail -n 200

6) Config snapshot (config evidence)

Goal: freeze “the current state.”

Edge: routing table / policy snapshot
Application: feature flag state, env/secret version (the version, not the value)
Infra: LB pool members, weights, health-check state

How does it plug into the incident flow?

The most practical model:

The IC makes the containment decision in the first 5 minutes
The Scribe writes every action as “timestamp + command + result”
The Evidence Owner finishes the kit at the first calm moment after containment

This way, “putting out the fire” and “producing evidence” run at the same time.

Conclusion

The evidence collection kit does not slow down an incident; when set up correctly, it speeds it up. Because in the panic moment, the team does not stop to ask “what should we collect?” — the standard runs on its own. Operational leadership starts here: designing not only the system, but the system that manages the event itself.

Paylaş:

Bu yazı nasıldı?

Frequently Asked Questions

Common questions readers have about this article.

How do I realistically implement these three roles in a small team of only two or three engineers?

I often get asked if this structure is overkill for tiny teams. In my experience, it is actually more critical there because the risk of 'tunnel vision' is much higher. You do not necessarily need three separate physical people; you need three distinct mindsets. When I’m in a small group, I designate one person as the Incident Commander who also handles the Scribe duties, while the engineer actually touching the systems acts as the Evidence Owner. The goal is to stop the 'cowboy' fix where someone changes a config, restarts a service, and then realizes they have no idea what the original state was. By assigning these responsibilities explicitly, I ensure that even in a rush, someone is responsible for hitting 'copy' on that log file or taking a snapshot before the service is wiped.

Doesn't the process of collecting evidence significantly delay the 'Return to Service' during a critical outage?

There’s a common fear that evidence collection delays recovery, but I’ve seen the opposite play out many times. Someone skips a snapshot to save five minutes, the fix fails, and now they’ve lost the original state and spend five hours rebuilding from scratch. I treat evidence collection as a core part of the recovery process, not a side quest. My rule of thumb is 'collect while you investigate.' If you are already looking at a log, pipe it to a persistent file immediately. If you are about to change a configuration, back it up. This 'minimal but correct' approach actually speeds up recovery because it provides a solid ground to retreat to if your initial hypothesis is wrong. It turns a guessing game into a controlled operation.

What is the most common mistake you see regarding time standards during an active incident?

The biggest trap I see in the field is ignoring the time standard until the postmortem starts. I have spent countless frustrating hours trying to correlate logs where the application server was in UTC, the database was in local time, and the load balancer was skewed by forty seconds. I always start my incident response by verifying NTP or Chrony health across the fleet. If you cannot trust your timestamps, your evidence is just a pile of disconnected stories that don't line up. I insist on UTC across the board for all records. It is not just a technical preference; it is a survival mechanism. When you are under extreme pressure, you do not want to be doing mental math to figure out if a database lock happened before or after a specific API call.

Is it really necessary to use a full Evidence Collection Kit for minor performance glitches?

People often tell me, 'This is just a minor performance glitch, we don't need a forensic kit.' That is a dangerous myth. I have seen many 'performance issues' that were actually signs of a data exfiltration attempt or a misconfigured scraper hitting a sensitive endpoint. If you treat every incident as a potential security event, you build a professional muscle for data integrity. My Evidence Collection Kit is designed to be lightweight enough for a routine memory leak but robust enough for a breach investigation. By the time you realize an incident actually *is* a security event, it is usually too late to start collecting the data you need. Start early, keep it minimal, and you will never be caught empty-handed during a postmortem, regardless of the incident's root cause.

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

📌
Best of the week Single most-worth-reading post
🔧
Toolbox notes Real tools I used this week
🧠
Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

Posts Read

Reading Time

Day Streak

Favorite Category

Career

Evidence Collection Kit and Roles During an Incident

What are we aiming for?

Role assignment: three people, three responsibilities

Evidence Collection Kit: “minimal but reproducible”

1) Time standard

2) Change evidence

3) Access evidence

4) Service evidence

5) Host evidence

6) Config snapshot (config evidence)

How does it plug into the incident flow?

Conclusion

Frequently Asked Questions

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Minimum Viable Runbook Template and Incident Decision Points

The Decision Log and Handoff Discipline During Incident Rotation

Post-Change Verification Cadence: Smoke, SLO, and Rollback

What are we aiming for?

Role assignment: three people, three responsibilities

Evidence Collection Kit: “minimal but reproducible”

1) Time standard

2) Change evidence

3) Access evidence

4) Service evidence

5) Host evidence

6) Config snapshot (config evidence)

How does it plug into the incident flow?

Conclusion

Frequently Asked Questions

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Minimum Viable Runbook Template and Incident Decision Points

The Decision Log and Handoff Discipline During Incident Rotation

Post-Change Verification Cadence: Smoke, SLO, and Rollback

Klavye Kısayolları