Some incidents that look like “the API isn’t responding” in production have their root not in the application but in the Linux TCP queue. Especially under sudden traffic spikes, faulty health-check behavior, or an actual SYN flood; the connect() call can be killed before it even reaches the application.
This article is a runbook that makes SYN backlog + accept queue behavior operationally legible and is usable both for fast intervention and for lasting resilience.
1) Mental model: two distinct queues
A TCP listener has two separate “waiting areas”:
- SYN backlog (half-open connections): SYN arrived → SYN/ACK sent → waiting for ACK.
- Accept queue (completed connections): 3-way handshake done → the application will pick it up via
accept().
The two have different limits and different failure modes. So deciding only by the listen(backlog) value is often misleading.
2) Triage: a 5-minute checklist
2.1 Symptom verification
Typical client-side symptoms:
connect timeout,i/o timeout,upstream connect error- 5xx increase (at the proxy/LB layer)
- It’s not latency, but failed connection ratio that rises
Quick check on the server:
# Listening sockets and backlog indicators
ss -ltnp
# Summary queue of listening sockets
ss -ltn
If Recv-Q (accept queue) is rising, the app may not be keeping up with accept().
2.2 Are SYN cookies engaged?
sysctl net.ipv4.tcp_syncookies
0: off (earlier collapse under load)1: on (engages when SYN backlog fills)
2.3 Look at kernel counters (evidence)
netstat -s | rg -n "listen|SYN|cookie|overflow|drop|retrans" -S
Field interpretation:
- Counters like “listen queue overflow”: accept queue is filling
- “SYNs to LISTEN sockets dropped” / “SYN cookies sent”: SYN backlog pressure
Note: counter names may vary by kernel version; the goal isn’t to memorize the text but to capture the overflow/drop/cookie triad.
3) Quick mitigation (during an incident)
Priority: stabilize connectivity, then isolate the root cause.
3.1 Enable SYN cookies (controlled)
sudo sysctl -w net.ipv4.tcp_syncookies=1
To persist:
echo 'net.ipv4.tcp_syncookies = 1' | sudo tee /etc/sysctl.d/99-syncookies.conf
sudo sysctl --system
3.2 Raise backlog limits
Even if the app’s listen(backlog) is high, the effect stays limited if kernel ceilings are tight.
sudo sysctl -w net.core.somaxconn=4096
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=8192
Field notes:
somaxconn: ceiling for the accept queuetcp_max_syn_backlog: ceiling for the SYN backlog (especially under heavy SYN load)
3.3 Fix LB/edge behavior (biggest win is here)
The most expensive mistake: the health-check increasing traffic during the issue.
Operational checklist:
- Is the health-check frequency too aggressive?
- Are multiple layers retrying at the same time? (LB + gateway + client)
- Is traffic “sticking” to a single problematic pod/host?
When possible during the incident:
- Reduce retry counts / apply a retry budget
- Raise the health-check threshold (reduce flap)
- Distribute traffic gradually (weight/priority)
4) Lasting resilience: design choices
4.1 “Growing the backlog” is not a strategy on its own
The backlog is just a buffer. The real goals:
- Cut SYN pressure at the edge (rate limit / SYN proxy / WAF)
- Increase accept rate (event loop,
accept()flow, threading model) - Anticipate burst patterns (campaigns, batch, cron)
4.2 Observability: which metrics should alert?
Recommended signals:
- SYN cookie usage ratio / counter
- Listen/accept overflow counters
SYN-RECVstate ratio (sudden rise)- LB “connect error” rate (upstream connect failure)
5) Runbook closing: verification and recovery
Post-mitigation verification:
- Is it not 401/5xx, but specifically the connect error rate that’s dropping?
- Are SYN cookie counters declining?
- Has
Recv-Qreturned to normal inssoutput?
Recovery:
- Before promoting temporary sysctls to permanent, clarify the capacity bump and the edge policy.
- Write a rollback plan: previous values, change time, rollback command.
Final note: This incident class often starts as “the app is slow” but is most of the time network and kernel behavior. Operational leadership is about establishing the right reflex at the right layer: evidence → mitigation → permanent design.