In production, certain network incidents look like “the service broke” but the actual root cause is saturation in the kernel’s packet processing path: SoftIRQ climbs, ksoftirqd eats CPU, packets start dropping, latency rises. The picture is misleading — “CPU looks idle but the network is dead.”
This runbook gives you a practical path for nodes carrying high traffic — gateways, NAT boxes, edge proxies, IDS/IPS, or anything with heavy east-west traffic — so you can quickly triage softirq saturation and apply safe tuning.
When do I get suspicious?
I start thinking softirq when these symptoms show up together:
- p95/p99 latency rises, error rate climbs
sar -n DEVshows growingrxdrop/txdroptop/htopshowsksoftirqdand%si(softirq) trending up- NIC stats show counters like
rx_missed_errors,rx_no_buffer, ring overflow growing
0) Safety: what do I do in the first 5 minutes?
- Touching kernel / NIC tuning in panic mode just turns a misconfiguration into a bigger outage.
- Measure first, take small steps next, and verify the impact at every step.
1) Quick triage: a 10-command checklist
1.1 CPU and softirq distribution
mpstat -P ALL 1 5
Look for: are some cores hot on %si? Is everything piling onto a single core?
1.2 SoftIRQ counters
cat /proc/softirqs
Look for: are NET_RX and NET_TX climbing fast? Stuck on a single CPU?
1.3 NIC drop and error counters
ip -s link show dev eth0
ethtool -S eth0 | egrep -i 'drop|miss|error|no_buffer|timeout' | head -n 50
1.4 IRQ distribution
cat /proc/interrupts | egrep -i 'eth0|mlx|ixgbe|i40e|ena|virtio' | head -n 20
Look for: are IRQs piling on a single CPU?
1.5 Socket / TCP state (queues)
ss -s
netstat -s | head -n 80
1.6 Application side
- Is there a CPU-heavy job running on this node at the same time?
- Did IRQ affinity change? (recent config/deploy)
2) Root cause classes
I usually bin softirq incidents into these classes:
- Pile-up on a single CPU (bad IRQ affinity, low queue count, irqbalance gone wrong)
- NIC ring/queue undersized (overflow during burst)
- Packet processing cost (conntrack/NAT, iptables/nftables, encapsulation, VXLAN/GRE)
- Driver/firmware (bug, offload incompatibility)
- Noisy neighbour (CPU steal / noisy neighbour on virt)
Identifying the class lets you pick the right tuning knob.
3) Intervention steps (controlled)
3.1 Fix IRQ affinity (the most common win)
Goal: stop NIC interrupts from piling onto a single core.
irqbalanceworks well sometimes and badly other times. On critical production nodes, “deliberate pinning” tends to be safer.
Check:
systemctl status irqbalance
Application:
- Pick CPU cores based on the number of NIC queues
- Map IRQ -> core
- After changes, verify the distribution via
/proc/interrupts
3.2 Increase RX/TX queue count
The number of queues a NIC supports defines parallel packet processing capacity.
ethtool -l eth0
sudo ethtool -L eth0 combined 8
Note: the right value depends on the NIC/driver. Pushing it too high also adds overhead.
3.3 Grow the ring buffer (breathing room during bursts)
ethtool -g eth0
sudo ethtool -G eth0 rx 4096 tx 4096
This can reduce drops during microbursts. But growing the ring can also slightly increase latency; advance with measurements.
3.4 RPS/RFS (distribute work after the interrupt)
If IRQ affinity alone isn’t enough, RPS can be considered to distribute the receive path.
These knobs are powerful but increase CPU consumption when misused. So:
- Try IRQ affinity + queues first
- Then roll out a limited RPS canary
3.5 Offload settings (GRO/LRO/TSO) — careful
Offloads can reduce CPU but for some traffic types they affect latency or debug-ability.
ethtool -k eth0 | head -n 40
Make changes incrementally; watch p95/p99 + drops at each step.
4) Verification: the evidence for “we fixed it”
These are the signals I look for:
rxdrop/txdropstabilized or dropped%siis now distributed (no single-core saturation)ksoftirqdCPU usage decreased- App p95/p99 returned to the SLO band
5) Permanent prevention: alarms and capacity
These incidents come back. To make the fix stick:
- Alarm on
softirqratio (host-level) - Alarm on NIC drop/error counters
- Capacity baseline for “peak PPS” and “CPU per packet”
- If you use NAT/conntrack, give it its own budget and limits
Wrap-up
SoftIRQ saturation is a manageable incident class with the right runbook. Measure first, identify the root-cause class, then proceed with small and verifiable steps: IRQ affinity, queues, rings, and RPS if needed. That way vague complaints like “the network goes away sometimes” turn into a controllable and repeatable operation.