Linux Conntrack Capacity Planning and Alerting Runbook

The conntrack problem is experienced in most teams as “everything broke at once”. In reality, the signal is always there: table utilization rises, new connections start to drop, the application generates retries, and the incident grows.

This article gives the full capacity planning + alerting + incident runbook for Linux conntrack. The goal is to set up an operationally manageable model before the “table is full” moment arrives.

Why is conntrack critical?

Wherever a stateful firewall/NAT is used, conntrack becomes an invisible “single resource”:

L4 load balancer / reverse proxy
NAT gateway / egress node
Kubernetes nodes (overlay + service NAT)
Edge servers (nftables/iptables)

Typical symptoms when the table fills up:

New TCP connections cannot be established
UDP “silently” disappears
The app falls into the “service is up but not working” pattern

Measure: table utilization and drop signals

The two most basic metrics:

cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

Quick ratio:

python - <<'PY'
import pathlib
count=int(pathlib.Path("/proc/sys/net/netfilter/nf_conntrack_count").read_text().strip())
mx=int(pathlib.Path("/proc/sys/net/netfilter/nf_conntrack_max").read_text().strip())
print(f"conntrack_utilization={(count/mx)*100:.1f}% ({count}/{mx})")
PY

Kernel log signal (critical):

dmesg -T | rg -n "nf_conntrack: table full|conntrack" || true

If you see “table full”, you’ve already started dropping.

Alert design (field thresholds)

In the field I use three thresholds:

70%: trend alert (is it rising?)
85%: action alert (prepare mitigation)
95%: critical (incident command + traffic reduction)

Capacity planning: “how many connections is normal?”

Conntrack capacity isn’t only about “is there RAM?”; it’s about the connection profile:

Average connection lifetime (keep-alive, long polling)
UDP timeouts (DNS, syslog, VoIP)
Number of clients behind NAT (burst)
DDoS / scan behavior (malicious)

Practical measurement (top 20 talkers):

sudo conntrack -S 2>/dev/null || true
sudo conntrack -L 2>/dev/null | head -n 5 || true

If the conntrack tool is missing:

ss -s
ss -ant state established | wc -l

Safe tuning: nf_conntrack_max and timeouts

First rule: just raising max usually only hides the problem. Still, doing it correctly relieves pressure.

Example (temporary):

sudo sysctl -w net.netfilter.nf_conntrack_max=524288

To make it persistent:

Write into /etc/sysctl.d/99-conntrack.conf
Manage the change as a tracked change record

Tuning timeouts is more impactful (especially UDP):

sudo sysctl -a | rg "nf_conntrack_(tcp|udp)_" | head

Incident Runbook: if the table is heading toward full

1) Triage (5 min)

date
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
dmesg -T | tail -n 120
ss -s

Ask:

Is this node doing NAT? (egress / LB / gateway)
Is the traffic “legitimate or anomalous”?
Did a deploy/feature produce a new connection pattern?

2) Containment: first safe moves

Suspected malicious/scan: rate-limit / synproxy / upstream filtering at the edge
Legitimate traffic: reduce components creating new connections (e.g. worker count), review keep-alive settings
Per node: shift traffic away (lower LB weight / drain)

3) Temporary relief (controlled)

Raise nf_conntrack_max (measuring RAM)
Tune UDP timeouts in small steps

Verification after the change:

watch -n 2 'echo -n "count="; cat /proc/sys/net/netfilter/nf_conntrack_count; echo -n "max="; cat /proc/sys/net/netfilter/nf_conntrack_max'

4) Recovery standard

The most common mistake in conntrack incidents: leaving the temporary increase in place permanently.

Runbook standard:

Root cause and permanent action within 24–48 hours
An “expire” note for temporary sysctls
A “connection profile” graph in the retrospective

Postmortem checklist (permanent fix)

Application: keep-alive, connection pool, retry budget, timeouts
Edge: SYN flood resilience, rate-limit, WAF/IDS signal
Platform: drain/evacuation flow, per-node conntrack alerts
Security: scan/abuse detection, automatic blocklist / upstream coordination

Conclusion

Conntrack is the network’s invisible capacity limit. The manageable model is the trio of measurement + alerting + controlled intervention. Seeing the “table full” log actually means you missed the alarm; the goal is to be able to intervene before that log is generated.

Linux Conntrack Capacity Planning and Alerting Runbook

Why is conntrack critical?

Measure: table utilization and drop signals

Alert design (field thresholds)

Capacity planning: “how many connections is normal?”

Safe tuning: nf_conntrack_max and timeouts

Incident Runbook: if the table is heading toward full

1) Triage (5 min)

2) Containment: first safe moves

3) Temporary relief (controlled)

4) Recovery standard

Postmortem checklist (permanent fix)

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Linux TCP Backlog and SYN Flood Resilience Runbook

A Safe Migration Runbook from iptables to nftables

Why is conntrack critical?

Measure: table utilization and drop signals

Alert design (field thresholds)

Capacity planning: “how many connections is normal?”

Safe tuning: nf_conntrack_max and timeouts

Incident Runbook: if the table is heading toward full

1) Triage (5 min)

2) Containment: first safe moves

3) Temporary relief (controlled)

4) Recovery standard

Postmortem checklist (permanent fix)

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Linux TCP Backlog and SYN Flood Resilience Runbook

A Safe Migration Runbook from iptables to nftables

Klavye Kısayolları