When a kernel panic hits in production, I usually see one of two extremes: “the box rebooted itself, moving on” or “we spent hours trying to find the cause.” Especially with storage drivers, NIC offload, kernel upgrades, or hardware-induced issues, the most valuable artefact you can leave behind after the panic is the vmcore.
kdump boots a second kernel at panic time, captures a memory dump, and leaves “evidence.” This post is not a “just install it” guide; my goal is wiring up dump capture, retention, regular tests, and the incident workflow as one coherent thing.
Prerequisites: make kdump sustainable in production
Accept three facts about kdump before anything else:
- The dump can be huge (depends on RAM size)
- Keeping the dump only on local disk is not always safe
- Panics happen “rarely,” so if you don’t test kdump, it may simply not work when you need it
So your initial goals should be:
- decide on the dump destination (local + central copy)
- set a disk budget and retention policy
- write down a regular test plan
1) Crash kernel reservation (crashkernel)
For kdump to work, a slice of memory has to be reserved for the “crash kernel.” This is usually done via a kernel boot parameter.
Approach (varies by distro):
- Set the
crashkernel=parameter in the bootloader - After reboot, verify the reservation actually happened
To verify:
cat /proc/cmdline
dmesg | rg -i "crashkernel|reserved"
2) Installing and enabling the kdump service
Package names vary across distros, but the logic is the same:
- Install kdump/kexec tooling
- Enable the kdump service
- Configure the dump destination (local disk / NFS / SSH / object storage gateway)
Status check:
systemctl status kdump || true
kdumpctl status || true
3) Dump destination: not “the only place I can write to” but “a place I can recover from”
The two-tier model I have found safe in the field:
- Primary: local disk (fast write, ready right after reboot)
- Secondary: central store (NFS/SSH) -> reachable for incident analysis
If you do use a central destination:
- Restrict the network path (only dump traffic)
- Harden credential / key management
- Assume the dump may contain PII (access controls and retention)
4) Test: trigger a controlled panic and verify
You don’t want the “first panic” in production to also be your first kdump experience. Run a controlled test in a pilot environment:
echo 1 | sudo tee /proc/sys/kernel/sysrq
echo c | sudo tee /proc/sysrq-trigger
The system will panic and reboot. Afterwards, check whether the dump file was actually produced:
ls -lah /var/crash || true
5) Triage runbook: the first 30 minutes with vmcore
After a panic, the first goal is not “solve the root cause right now” — it should be preserving the evidence and classifying the event.
My 30-minute flow:
- Confirm the dump file exists (locally and/or centrally)
- Note the kernel version / build ID / uptime
- Check recent changes: kernel update, NIC/driver, firmware, workload shift
- If it has happened before: same ring? same hardware?
- Prepare the “crash” environment for offline vmcore analysis
Initial evidence bundle:
vmcoredmesg/journalctl -b -1uname -a- change record (change/ticket)
6) Production discipline: retention and cost
Dumps can be large. A policy I recommend:
- Keep “the latest 1 dump” per host
- 7-30 days retention in the central store
- For critical systems, automatically attach the dump to the incident “evidence bundle”
Wrap-up
kdump turns a kernel panic from an “invisible reboot” into an analyzable event. The real value comes when crashkernel sizing, dump destination, test plan, and the triage runbook are designed together. If you want continuity in production, you have to build infrastructure that can leave evidence behind even for the worst class of bugs.