Linux TCP Backlog and SYN Flood Resilience Runbook

Some incidents that look like “the API isn’t responding” in production have their root not in the application but in the Linux TCP queue. Especially under sudden traffic spikes, faulty health-check behavior, or an actual SYN flood; the connect() call can be killed before it even reaches the application.

This article is a runbook that makes SYN backlog + accept queue behavior operationally legible and is usable both for fast intervention and for lasting resilience.

1) Mental model: two distinct queues

A TCP listener has two separate “waiting areas”:

SYN backlog (half-open connections): SYN arrived → SYN/ACK sent → waiting for ACK.
Accept queue (completed connections): 3-way handshake done → the application will pick it up via accept().

The two have different limits and different failure modes. So deciding only by the listen(backlog) value is often misleading.

2) Triage: a 5-minute checklist

2.1 Symptom verification

Typical client-side symptoms:

connect timeout, i/o timeout, upstream connect error
5xx increase (at the proxy/LB layer)
It’s not latency, but failed connection ratio that rises

Quick check on the server:

# Listening sockets and backlog indicators
ss -ltnp

# Summary queue of listening sockets
ss -ltn

If Recv-Q (accept queue) is rising, the app may not be keeping up with accept().

2.2 Are SYN cookies engaged?

sysctl net.ipv4.tcp_syncookies

0: off (earlier collapse under load)
1: on (engages when SYN backlog fills)

2.3 Look at kernel counters (evidence)

netstat -s | rg -n "listen|SYN|cookie|overflow|drop|retrans" -S

Field interpretation:

Counters like “listen queue overflow”: accept queue is filling
“SYNs to LISTEN sockets dropped” / “SYN cookies sent”: SYN backlog pressure

Note: counter names may vary by kernel version; the goal isn’t to memorize the text but to capture the overflow/drop/cookie triad.

3) Quick mitigation (during an incident)

Priority: stabilize connectivity, then isolate the root cause.

3.1 Enable SYN cookies (controlled)

sudo sysctl -w net.ipv4.tcp_syncookies=1

To persist:

echo 'net.ipv4.tcp_syncookies = 1' | sudo tee /etc/sysctl.d/99-syncookies.conf
sudo sysctl --system

3.2 Raise backlog limits

Even if the app’s listen(backlog) is high, the effect stays limited if kernel ceilings are tight.

sudo sysctl -w net.core.somaxconn=4096
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=8192

Field notes:

somaxconn: ceiling for the accept queue
tcp_max_syn_backlog: ceiling for the SYN backlog (especially under heavy SYN load)

3.3 Fix LB/edge behavior (biggest win is here)

The most expensive mistake: the health-check increasing traffic during the issue.

Operational checklist:

Is the health-check frequency too aggressive?
Are multiple layers retrying at the same time? (LB + gateway + client)
Is traffic “sticking” to a single problematic pod/host?

When possible during the incident:

Reduce retry counts / apply a retry budget
Raise the health-check threshold (reduce flap)
Distribute traffic gradually (weight/priority)

4) Lasting resilience: design choices

4.1 “Growing the backlog” is not a strategy on its own

The backlog is just a buffer. The real goals:

Cut SYN pressure at the edge (rate limit / SYN proxy / WAF)
Increase accept rate (event loop, accept() flow, threading model)
Anticipate burst patterns (campaigns, batch, cron)

4.2 Observability: which metrics should alert?

Recommended signals:

SYN cookie usage ratio / counter
Listen/accept overflow counters
SYN-RECV state ratio (sudden rise)
LB “connect error” rate (upstream connect failure)

5) Runbook closing: verification and recovery

Post-mitigation verification:

Is it not 401/5xx, but specifically the connect error rate that’s dropping?
Are SYN cookie counters declining?
Has Recv-Q returned to normal in ss output?

Recovery:

Before promoting temporary sysctls to permanent, clarify the capacity bump and the edge policy.
Write a rollback plan: previous values, change time, rollback command.

Final note: This incident class often starts as “the app is slow” but is most of the time network and kernel behavior. Operational leadership is about establishing the right reflex at the right layer: evidence → mitigation → permanent design.

Linux TCP Backlog and SYN Flood Resilience Runbook

1) Mental model: two distinct queues

2) Triage: a 5-minute checklist

2.1 Symptom verification

2.2 Are SYN cookies engaged?

2.3 Look at kernel counters (evidence)

3) Quick mitigation (during an incident)

3.1 Enable SYN cookies (controlled)

3.2 Raise backlog limits

3.3 Fix LB/edge behavior (biggest win is here)

4) Lasting resilience: design choices

4.1 “Growing the backlog” is not a strategy on its own

4.2 Observability: which metrics should alert?

5) Runbook closing: verification and recovery

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Linux Conntrack Capacity Planning and Alerting Runbook

Packet Capture in Production with tcpdump: A Runbook

1) Mental model: two distinct queues

2) Triage: a 5-minute checklist

2.1 Symptom verification

2.2 Are SYN cookies engaged?

2.3 Look at kernel counters (evidence)

3) Quick mitigation (during an incident)

3.1 Enable SYN cookies (controlled)

3.2 Raise backlog limits

3.3 Fix LB/edge behavior (biggest win is here)

4) Lasting resilience: design choices

4.1 “Growing the backlog” is not a strategy on its own

4.2 Observability: which metrics should alert?

5) Runbook closing: verification and recovery

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Linux SoftIRQ Saturation and IRQ Affinity Runbook

Linux Conntrack Capacity Planning and Alerting Runbook

Packet Capture in Production with tcpdump: A Runbook

Klavye Kısayolları