Linux kdump: Kernel Panic Crash Dump and Triage Runbook

When a kernel panic hits in production, I usually see one of two extremes: “the box rebooted itself, moving on” or “we spent hours trying to find the cause.” Especially with storage drivers, NIC offload, kernel upgrades, or hardware-induced issues, the most valuable artefact you can leave behind after the panic is the vmcore.

kdump boots a second kernel at panic time, captures a memory dump, and leaves “evidence.” This post is not a “just install it” guide; my goal is wiring up dump capture, retention, regular tests, and the incident workflow as one coherent thing.

Prerequisites: make kdump sustainable in production

Accept three facts about kdump before anything else:

The dump can be huge (depends on RAM size)
Keeping the dump only on local disk is not always safe
Panics happen “rarely,” so if you don’t test kdump, it may simply not work when you need it

So your initial goals should be:

decide on the dump destination (local + central copy)
set a disk budget and retention policy
write down a regular test plan

1) Crash kernel reservation (crashkernel)

For kdump to work, a slice of memory has to be reserved for the “crash kernel.” This is usually done via a kernel boot parameter.

Approach (varies by distro):

Set the crashkernel= parameter in the bootloader
After reboot, verify the reservation actually happened

To verify:

cat /proc/cmdline
dmesg | rg -i "crashkernel|reserved"

2) Installing and enabling the kdump service

Package names vary across distros, but the logic is the same:

Install kdump/kexec tooling
Enable the kdump service
Configure the dump destination (local disk / NFS / SSH / object storage gateway)

Status check:

systemctl status kdump || true
kdumpctl status || true

3) Dump destination: not “the only place I can write to” but “a place I can recover from”

The two-tier model I have found safe in the field:

Primary: local disk (fast write, ready right after reboot)
Secondary: central store (NFS/SSH) -> reachable for incident analysis

If you do use a central destination:

Restrict the network path (only dump traffic)
Harden credential / key management
Assume the dump may contain PII (access controls and retention)

4) Test: trigger a controlled panic and verify

You don’t want the “first panic” in production to also be your first kdump experience. Run a controlled test in a pilot environment:

echo 1 | sudo tee /proc/sys/kernel/sysrq
echo c | sudo tee /proc/sysrq-trigger

The system will panic and reboot. Afterwards, check whether the dump file was actually produced:

ls -lah /var/crash || true

5) Triage runbook: the first 30 minutes with vmcore

After a panic, the first goal is not “solve the root cause right now” — it should be preserving the evidence and classifying the event.

My 30-minute flow:

Confirm the dump file exists (locally and/or centrally)
Note the kernel version / build ID / uptime
Check recent changes: kernel update, NIC/driver, firmware, workload shift
If it has happened before: same ring? same hardware?
Prepare the “crash” environment for offline vmcore analysis

Initial evidence bundle:

vmcore
dmesg / journalctl -b -1
uname -a
change record (change/ticket)

6) Production discipline: retention and cost

Dumps can be large. A policy I recommend:

Keep “the latest 1 dump” per host
7-30 days retention in the central store
For critical systems, automatically attach the dump to the incident “evidence bundle”

Wrap-up

kdump turns a kernel panic from an “invisible reboot” into an analyzable event. The real value comes when crashkernel sizing, dump destination, test plan, and the triage runbook are designed together. If you want continuity in production, you have to build infrastructure that can leave evidence behind even for the worst class of bugs.

Linux kdump: Kernel Panic Crash Dump and Triage Runbook

Prerequisites: make kdump sustainable in production

1) Crash kernel reservation (crashkernel)

2) Installing and enabling the kdump service

3) Dump destination: not “the only place I can write to” but “a place I can recover from”

4) Test: trigger a controlled panic and verify

5) Triage runbook: the first 30 minutes with vmcore

6) Production discipline: retention and cost

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Core Dump Management and Privacy Runbook with systemd-coredump

Self-Healing Services with systemd Watchdog

Packet Capture in Production with tcpdump: A Runbook

Prerequisites: make kdump sustainable in production

1) Crash kernel reservation (crashkernel)

2) Installing and enabling the kdump service

3) Dump destination: not “the only place I can write to” but “a place I can recover from”

4) Test: trigger a controlled panic and verify

5) Triage runbook: the first 30 minutes with vmcore

6) Production discipline: retention and cost

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Core Dump Management and Privacy Runbook with systemd-coredump

Self-Healing Services with systemd Watchdog

Packet Capture in Production with tcpdump: A Runbook

Klavye Kısayolları