İçeriğe Atla
Mustafa Erbay
Tutorials · 13 min read · görüntülenme Türkçe oku
100%

Linux kdump: Kernel Panic Crash Dump and Triage Runbook

Walks through kdump installation, validation and a sustainable production dump retention flow so you can capture vmcore and triage quickly when a kernel panics.

Linux kdump: Kernel Panic Crash Dump and Triage Runbook — cover image

When a kernel panic hits in production, I usually see one of two extremes: “the box rebooted itself, moving on” or “we spent hours trying to find the cause.” Especially with storage drivers, NIC offload, kernel upgrades, or hardware-induced issues, the most valuable artefact you can leave behind after the panic is the vmcore.

kdump boots a second kernel at panic time, captures a memory dump, and leaves “evidence.” This post is not a “just install it” guide; my goal is wiring up dump capture, retention, regular tests, and the incident workflow as one coherent thing.

Prerequisites: make kdump sustainable in production

Accept three facts about kdump before anything else:

  • The dump can be huge (depends on RAM size)
  • Keeping the dump only on local disk is not always safe
  • Panics happen “rarely,” so if you don’t test kdump, it may simply not work when you need it

So your initial goals should be:

  • decide on the dump destination (local + central copy)
  • set a disk budget and retention policy
  • write down a regular test plan

1) Crash kernel reservation (crashkernel)

For kdump to work, a slice of memory has to be reserved for the “crash kernel.” This is usually done via a kernel boot parameter.

Approach (varies by distro):

  • Set the crashkernel= parameter in the bootloader
  • After reboot, verify the reservation actually happened

To verify:

cat /proc/cmdline
dmesg | rg -i "crashkernel|reserved"

2) Installing and enabling the kdump service

Package names vary across distros, but the logic is the same:

  1. Install kdump/kexec tooling
  2. Enable the kdump service
  3. Configure the dump destination (local disk / NFS / SSH / object storage gateway)

Status check:

systemctl status kdump || true
kdumpctl status || true

3) Dump destination: not “the only place I can write to” but “a place I can recover from”

The two-tier model I have found safe in the field:

  • Primary: local disk (fast write, ready right after reboot)
  • Secondary: central store (NFS/SSH) -> reachable for incident analysis

If you do use a central destination:

  • Restrict the network path (only dump traffic)
  • Harden credential / key management
  • Assume the dump may contain PII (access controls and retention)

4) Test: trigger a controlled panic and verify

You don’t want the “first panic” in production to also be your first kdump experience. Run a controlled test in a pilot environment:

echo 1 | sudo tee /proc/sys/kernel/sysrq
echo c | sudo tee /proc/sysrq-trigger

The system will panic and reboot. Afterwards, check whether the dump file was actually produced:

ls -lah /var/crash || true

5) Triage runbook: the first 30 minutes with vmcore

After a panic, the first goal is not “solve the root cause right now” — it should be preserving the evidence and classifying the event.

My 30-minute flow:

  1. Confirm the dump file exists (locally and/or centrally)
  2. Note the kernel version / build ID / uptime
  3. Check recent changes: kernel update, NIC/driver, firmware, workload shift
  4. If it has happened before: same ring? same hardware?
  5. Prepare the “crash” environment for offline vmcore analysis

Initial evidence bundle:

  • vmcore
  • dmesg / journalctl -b -1
  • uname -a
  • change record (change/ticket)

6) Production discipline: retention and cost

Dumps can be large. A policy I recommend:

  • Keep “the latest 1 dump” per host
  • 7-30 days retention in the central store
  • For critical systems, automatically attach the dump to the incident “evidence bundle”

Wrap-up

kdump turns a kernel panic from an “invisible reboot” into an analyzable event. The real value comes when crashkernel sizing, dump destination, test plan, and the triage runbook are designed together. If you want continuity in production, you have to build infrastructure that can leave evidence behind even for the worst class of bugs.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts