Linux SoftIRQ Saturation and IRQ Affinity Runbook

In production, certain network incidents look like “the service broke” but the actual root cause is saturation in the kernel’s packet processing path: SoftIRQ climbs, ksoftirqd eats CPU, packets start dropping, latency rises. The picture is misleading — “CPU looks idle but the network is dead.”

This runbook gives you a practical path for nodes carrying high traffic — gateways, NAT boxes, edge proxies, IDS/IPS, or anything with heavy east-west traffic — so you can quickly triage softirq saturation and apply safe tuning.

When do I get suspicious?

I start thinking softirq when these symptoms show up together:

p95/p99 latency rises, error rate climbs
sar -n DEV shows growing rxdrop/txdrop
top/htop shows ksoftirqd and %si (softirq) trending up
NIC stats show counters like rx_missed_errors, rx_no_buffer, ring overflow growing

0) Safety: what do I do in the first 5 minutes?

Touching kernel / NIC tuning in panic mode just turns a misconfiguration into a bigger outage.
Measure first, take small steps next, and verify the impact at every step.

1) Quick triage: a 10-command checklist

1.1 CPU and softirq distribution

mpstat -P ALL 1 5

Look for: are some cores hot on %si? Is everything piling onto a single core?

1.2 SoftIRQ counters

cat /proc/softirqs

Look for: are NET_RX and NET_TX climbing fast? Stuck on a single CPU?

1.3 NIC drop and error counters

ip -s link show dev eth0
ethtool -S eth0 | egrep -i 'drop|miss|error|no_buffer|timeout' | head -n 50

1.4 IRQ distribution

cat /proc/interrupts | egrep -i 'eth0|mlx|ixgbe|i40e|ena|virtio' | head -n 20

Look for: are IRQs piling on a single CPU?

1.5 Socket / TCP state (queues)

ss -s
netstat -s | head -n 80

1.6 Application side

Is there a CPU-heavy job running on this node at the same time?
Did IRQ affinity change? (recent config/deploy)

2) Root cause classes

I usually bin softirq incidents into these classes:

Pile-up on a single CPU (bad IRQ affinity, low queue count, irqbalance gone wrong)
NIC ring/queue undersized (overflow during burst)
Packet processing cost (conntrack/NAT, iptables/nftables, encapsulation, VXLAN/GRE)
Driver/firmware (bug, offload incompatibility)
Noisy neighbour (CPU steal / noisy neighbour on virt)

Identifying the class lets you pick the right tuning knob.

3) Intervention steps (controlled)

3.1 Fix IRQ affinity (the most common win)

Goal: stop NIC interrupts from piling onto a single core.

irqbalance works well sometimes and badly other times. On critical production nodes, “deliberate pinning” tends to be safer.

Check:

systemctl status irqbalance

Application:

Pick CPU cores based on the number of NIC queues
Map IRQ -> core
After changes, verify the distribution via /proc/interrupts

3.2 Increase RX/TX queue count

The number of queues a NIC supports defines parallel packet processing capacity.

ethtool -l eth0
sudo ethtool -L eth0 combined 8

Note: the right value depends on the NIC/driver. Pushing it too high also adds overhead.

3.3 Grow the ring buffer (breathing room during bursts)

ethtool -g eth0
sudo ethtool -G eth0 rx 4096 tx 4096

This can reduce drops during microbursts. But growing the ring can also slightly increase latency; advance with measurements.

3.4 RPS/RFS (distribute work after the interrupt)

If IRQ affinity alone isn’t enough, RPS can be considered to distribute the receive path.

These knobs are powerful but increase CPU consumption when misused. So:

Try IRQ affinity + queues first
Then roll out a limited RPS canary

3.5 Offload settings (GRO/LRO/TSO) — careful

Offloads can reduce CPU but for some traffic types they affect latency or debug-ability.

ethtool -k eth0 | head -n 40

Make changes incrementally; watch p95/p99 + drops at each step.

4) Verification: the evidence for “we fixed it”

These are the signals I look for:

rxdrop/txdrop stabilized or dropped
%si is now distributed (no single-core saturation)
ksoftirqd CPU usage decreased
App p95/p99 returned to the SLO band

5) Permanent prevention: alarms and capacity

These incidents come back. To make the fix stick:

Alarm on softirq ratio (host-level)
Alarm on NIC drop/error counters
Capacity baseline for “peak PPS” and “CPU per packet”
If you use NAT/conntrack, give it its own budget and limits

Wrap-up

SoftIRQ saturation is a manageable incident class with the right runbook. Measure first, identify the root-cause class, then proceed with small and verifiable steps: IRQ affinity, queues, rings, and RPS if needed. That way vague complaints like “the network goes away sometimes” turn into a controllable and repeatable operation.

Linux SoftIRQ Saturation and IRQ Affinity Runbook

When do I get suspicious?

0) Safety: what do I do in the first 5 minutes?

1) Quick triage: a 10-command checklist

1.1 CPU and softirq distribution

1.2 SoftIRQ counters

1.3 NIC drop and error counters

1.4 IRQ distribution

1.5 Socket / TCP state (queues)

1.6 Application side

2) Root cause classes

3) Intervention steps (controlled)

3.1 Fix IRQ affinity (the most common win)

3.2 Increase RX/TX queue count

3.3 Grow the ring buffer (breathing room during bursts)

3.4 RPS/RFS (distribute work after the interrupt)

3.5 Offload settings (GRO/LRO/TSO) — careful

4) Verification: the evidence for “we fixed it”

5) Permanent prevention: alarms and capacity

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux Conntrack Capacity Planning and Alerting Runbook

Linux TCP Backlog and SYN Flood Resilience Runbook

Secure Network Device Monitoring with SNMPv3: Auth, Encryption, ACL

When do I get suspicious?

0) Safety: what do I do in the first 5 minutes?

1) Quick triage: a 10-command checklist

1.1 CPU and softirq distribution

1.2 SoftIRQ counters

1.3 NIC drop and error counters

1.4 IRQ distribution

1.5 Socket / TCP state (queues)

1.6 Application side

2) Root cause classes

3) Intervention steps (controlled)

3.1 Fix IRQ affinity (the most common win)

3.2 Increase RX/TX queue count

3.3 Grow the ring buffer (breathing room during bursts)

3.4 RPS/RFS (distribute work after the interrupt)

3.5 Offload settings (GRO/LRO/TSO) — careful

4) Verification: the evidence for “we fixed it”

5) Permanent prevention: alarms and capacity

Wrap-up

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux Conntrack Capacity Planning and Alerting Runbook

Linux TCP Backlog and SYN Flood Resilience Runbook

Secure Network Device Monitoring with SNMPv3: Auth, Encryption, ACL

Klavye Kısayolları