In production, the sentence “OOM happened, the process died” is a result; the real problem is usually the memory pressure that started before the OOM. With the wrong reflex (e.g. “open swap, run drop_caches”), you save the day, but the same incident comes back in a different form.
This article presents a field-focused runbook that catches memory pressure early and evicts in a controlled way using the Cgroup v2 + PSI (Pressure Stall Information) + systemd-oomd trio on Linux.
What are we trying to solve?
The goal is to reduce these three problems at the same time:
- Randomness of the kernel OOM killer: the most critical process being chosen at the worst possible moment.
- Cascading collapse: memory pressure → latency increase → retry storm → more memory.
- Operational blindness: answering “why did it happen?” with intuition rather than evidence.
Prerequisites (checklist)
- The kernel and distribution must be running with Cgroup v2.
- The systemd version must include systemd-oomd (most modern distros do).
- PSI metrics must be readable.
Quick verification:
# cgroup v2 mi?
stat -fc %T /sys/fs/cgroup
# PSI dosyaları var mı?
ls /proc/pressure/
# oomd çalışıyor mu?
systemctl status systemd-oomd
Expected:
cgroup2fs/proc/pressure/memoryexistssystemd-oomdis active
What does PSI (Pressure) tell you?
PSI measures “the time the CPU spent waiting due to insufficient memory”. This generates a signal minutes before OOM.
Example reading:
cat /proc/pressure/memory
Field interpretation:
- If
someis rising: some tasks are waiting → latency starts to rise. - If
fullis rising: a significant part of the system is blocked → the incident is now visible.
Design: first ask “which service goes?”
At OOM time, the answer to “which process should be killed?” is part of the architectural decision.
A practical classification:
- Tier-0 (critical): control plane, identity, data layer (don’t die if at all possible)
- Tier-1: API/application workers (die but come back)
- Tier-2: batch, reports, cache warmer (the first to go)
Implementation: controlled eviction with systemd-oomd
While systemd manages services under slices, you can give oomd a policy along the lines of “if pressure is high in this group, kill”.
Example approach (not service-based, but slice-based management):
- Group application workloads under a separate slice (e.g.
apps.slice) - Place batch jobs into a separate, lower-priority slice (e.g.
batch.slice) - Enable the OOM policy on the slice
Example override for a slice:
sudo systemctl edit apps.slice
[Slice]
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=60%
Similarly, you can define a more aggressive limit for batch.
Runbook: step by step during an incident
1) Triage (5 minutes)
uptime
free -m
vmstat 1 5
cat /proc/pressure/memory
journalctl -u systemd-oomd --since "-30m" --no-pager
dmesg -T | tail -n 80
How to read it:
- PSI rising + heavy reclaim → there’s “pressure”
- OOM kill logs → it’s already too late, move to root-cause and containment
2) Containment (clear space in a controlled way)
Safe first moves:
- Stop/scale down the batch jobs that consume the most memory
- Cut “nice-to-have” processes like cache warmup/reports
- Reduce app workers in a controlled way (watch the traffic + retry effect)
Quick visibility:
ps -eo pid,ppid,cmd,rss --sort=-rss | head -n 20
systemd-cgtop -m
3) Verification (10 minutes)
Is PSI dropping?
watch -n 2 'cat /proc/pressure/memory; echo; free -m'
If PSI isn’t dropping but memory is rising:
- Possible memory leak
- Retry storm (missing queue/backpressure)
- Kernel slab / page cache pressure
4) Recovery standard
After things stabilize:
- Roll back the temporary scale-downs
- Add OOMD kill logs to the incident evidence set
- Build a metric/trace/log correlation for “why did it happen?”
Testing (before going to production)
A simple pressure test on lab/stage:
sudo apt-get install -y stress-ng || true
stress-ng --vm 2 --vm-bytes 80% --timeout 60s
Expected:
- PSI rises
- OOMD applies a controlled kill within the target slice
- Critical services (tier-0) are protected
Postmortem: a permanent improvement list
- Limits: per-service memory limit/requests, cache size
- Observation: PSI alarms, reclaim/pgfault indicators, oomd decision logs
- Resilience: queue/backpressure, retry budget, circuit breaker
- Operations: a written standard for the decision “which service goes first?”
Conclusion
systemd-oomd reduces the randomness of OOM and turns memory pressure into a controlled eviction. The value comes less from tool installation and more from the joint discipline of service priority, cgroup limits, and PSI-based early warning working together.