İçeriğe Atla
Mustafa Erbay
Tutorials · 8 min read · görüntülenme Türkçe oku
100%

Linux Conntrack Capacity Planning and Alerting Runbook

A practical guide for generating signals before the nf_conntrack table fills up, applying safe sysctl tuning, and recovering in a controlled way during an…

Linux Conntrack Capacity Planning and Alerting Runbook — cover image

The conntrack problem is experienced in most teams as “everything broke at once”. In reality, the signal is always there: table utilization rises, new connections start to drop, the application generates retries, and the incident grows.

This article gives the full capacity planning + alerting + incident runbook for Linux conntrack. The goal is to set up an operationally manageable model before the “table is full” moment arrives.

Why is conntrack critical?

Wherever a stateful firewall/NAT is used, conntrack becomes an invisible “single resource”:

  • L4 load balancer / reverse proxy
  • NAT gateway / egress node
  • Kubernetes nodes (overlay + service NAT)
  • Edge servers (nftables/iptables)

Typical symptoms when the table fills up:

  • New TCP connections cannot be established
  • UDP “silently” disappears
  • The app falls into the “service is up but not working” pattern

Measure: table utilization and drop signals

The two most basic metrics:

cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

Quick ratio:

python - <<'PY'
import pathlib
count=int(pathlib.Path("/proc/sys/net/netfilter/nf_conntrack_count").read_text().strip())
mx=int(pathlib.Path("/proc/sys/net/netfilter/nf_conntrack_max").read_text().strip())
print(f"conntrack_utilization={(count/mx)*100:.1f}% ({count}/{mx})")
PY

Kernel log signal (critical):

dmesg -T | rg -n "nf_conntrack: table full|conntrack" || true

If you see “table full”, you’ve already started dropping.

Alert design (field thresholds)

In the field I use three thresholds:

  • 70%: trend alert (is it rising?)
  • 85%: action alert (prepare mitigation)
  • 95%: critical (incident command + traffic reduction)

Capacity planning: “how many connections is normal?”

Conntrack capacity isn’t only about “is there RAM?”; it’s about the connection profile:

  • Average connection lifetime (keep-alive, long polling)
  • UDP timeouts (DNS, syslog, VoIP)
  • Number of clients behind NAT (burst)
  • DDoS / scan behavior (malicious)

Practical measurement (top 20 talkers):

sudo conntrack -S 2>/dev/null || true
sudo conntrack -L 2>/dev/null | head -n 5 || true

If the conntrack tool is missing:

ss -s
ss -ant state established | wc -l

Safe tuning: nf_conntrack_max and timeouts

First rule: just raising max usually only hides the problem. Still, doing it correctly relieves pressure.

Example (temporary):

sudo sysctl -w net.netfilter.nf_conntrack_max=524288

To make it persistent:

  • Write into /etc/sysctl.d/99-conntrack.conf
  • Manage the change as a tracked change record

Tuning timeouts is more impactful (especially UDP):

sudo sysctl -a | rg "nf_conntrack_(tcp|udp)_" | head

Incident Runbook: if the table is heading toward full

1) Triage (5 min)

date
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
dmesg -T | tail -n 120
ss -s

Ask:

  • Is this node doing NAT? (egress / LB / gateway)
  • Is the traffic “legitimate or anomalous”?
  • Did a deploy/feature produce a new connection pattern?

2) Containment: first safe moves

  • Suspected malicious/scan: rate-limit / synproxy / upstream filtering at the edge
  • Legitimate traffic: reduce components creating new connections (e.g. worker count), review keep-alive settings
  • Per node: shift traffic away (lower LB weight / drain)

3) Temporary relief (controlled)

  • Raise nf_conntrack_max (measuring RAM)
  • Tune UDP timeouts in small steps

Verification after the change:

watch -n 2 'echo -n "count="; cat /proc/sys/net/netfilter/nf_conntrack_count; echo -n "max="; cat /proc/sys/net/netfilter/nf_conntrack_max'

4) Recovery standard

The most common mistake in conntrack incidents: leaving the temporary increase in place permanently.

Runbook standard:

  • Root cause and permanent action within 24–48 hours
  • An “expire” note for temporary sysctls
  • A “connection profile” graph in the retrospective

Postmortem checklist (permanent fix)

  • Application: keep-alive, connection pool, retry budget, timeouts
  • Edge: SYN flood resilience, rate-limit, WAF/IDS signal
  • Platform: drain/evacuation flow, per-node conntrack alerts
  • Security: scan/abuse detection, automatic blocklist / upstream coordination

Conclusion

Conntrack is the network’s invisible capacity limit. The manageable model is the trio of measurement + alerting + controlled intervention. Seeing the “table full” log actually means you missed the alarm; the goal is to be able to intervene before that log is generated.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts