The conntrack problem is experienced in most teams as “everything broke at once”. In reality, the signal is always there: table utilization rises, new connections start to drop, the application generates retries, and the incident grows.
This article gives the full capacity planning + alerting + incident runbook for Linux conntrack. The goal is to set up an operationally manageable model before the “table is full” moment arrives.
Why is conntrack critical?
Wherever a stateful firewall/NAT is used, conntrack becomes an invisible “single resource”:
- L4 load balancer / reverse proxy
- NAT gateway / egress node
- Kubernetes nodes (overlay + service NAT)
- Edge servers (nftables/iptables)
Typical symptoms when the table fills up:
- New TCP connections cannot be established
- UDP “silently” disappears
- The app falls into the “service is up but not working” pattern
Measure: table utilization and drop signals
The two most basic metrics:
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
Quick ratio:
python - <<'PY'
import pathlib
count=int(pathlib.Path("/proc/sys/net/netfilter/nf_conntrack_count").read_text().strip())
mx=int(pathlib.Path("/proc/sys/net/netfilter/nf_conntrack_max").read_text().strip())
print(f"conntrack_utilization={(count/mx)*100:.1f}% ({count}/{mx})")
PY
Kernel log signal (critical):
dmesg -T | rg -n "nf_conntrack: table full|conntrack" || true
If you see “table full”, you’ve already started dropping.
Alert design (field thresholds)
In the field I use three thresholds:
- 70%: trend alert (is it rising?)
- 85%: action alert (prepare mitigation)
- 95%: critical (incident command + traffic reduction)
Capacity planning: “how many connections is normal?”
Conntrack capacity isn’t only about “is there RAM?”; it’s about the connection profile:
- Average connection lifetime (keep-alive, long polling)
- UDP timeouts (DNS, syslog, VoIP)
- Number of clients behind NAT (burst)
- DDoS / scan behavior (malicious)
Practical measurement (top 20 talkers):
sudo conntrack -S 2>/dev/null || true
sudo conntrack -L 2>/dev/null | head -n 5 || true
If the conntrack tool is missing:
ss -s
ss -ant state established | wc -l
Safe tuning: nf_conntrack_max and timeouts
First rule: just raising max usually only hides the problem. Still, doing it correctly relieves pressure.
Example (temporary):
sudo sysctl -w net.netfilter.nf_conntrack_max=524288
To make it persistent:
- Write into
/etc/sysctl.d/99-conntrack.conf - Manage the change as a tracked change record
Tuning timeouts is more impactful (especially UDP):
sudo sysctl -a | rg "nf_conntrack_(tcp|udp)_" | head
Incident Runbook: if the table is heading toward full
1) Triage (5 min)
date
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
dmesg -T | tail -n 120
ss -s
Ask:
- Is this node doing NAT? (egress / LB / gateway)
- Is the traffic “legitimate or anomalous”?
- Did a deploy/feature produce a new connection pattern?
2) Containment: first safe moves
- Suspected malicious/scan: rate-limit / synproxy / upstream filtering at the edge
- Legitimate traffic: reduce components creating new connections (e.g. worker count), review keep-alive settings
- Per node: shift traffic away (lower LB weight / drain)
3) Temporary relief (controlled)
- Raise
nf_conntrack_max(measuring RAM) - Tune UDP timeouts in small steps
Verification after the change:
watch -n 2 'echo -n "count="; cat /proc/sys/net/netfilter/nf_conntrack_count; echo -n "max="; cat /proc/sys/net/netfilter/nf_conntrack_max'
4) Recovery standard
The most common mistake in conntrack incidents: leaving the temporary increase in place permanently.
Runbook standard:
- Root cause and permanent action within 24–48 hours
- An “expire” note for temporary sysctls
- A “connection profile” graph in the retrospective
Postmortem checklist (permanent fix)
- Application: keep-alive, connection pool, retry budget, timeouts
- Edge: SYN flood resilience, rate-limit, WAF/IDS signal
- Platform: drain/evacuation flow, per-node conntrack alerts
- Security: scan/abuse detection, automatic blocklist / upstream coordination
Conclusion
Conntrack is the network’s invisible capacity limit. The manageable model is the trio of measurement + alerting + controlled intervention. Seeing the “table full” log actually means you missed the alarm; the goal is to be able to intervene before that log is generated.