İçeriğe Atla
Mustafa Erbay
Tutorials · 8 min read · görüntülenme Türkçe oku
100%

Linux TCP Backlog and SYN Flood Resilience Runbook

A runbook to triage the connect timeout crisis when the SYN backlog/accept queue fills up, apply rapid mitigation, and design lasting resilience.

Linux TCP Backlog and SYN Flood Resilience Runbook — cover image

Some incidents that look like “the API isn’t responding” in production have their root not in the application but in the Linux TCP queue. Especially under sudden traffic spikes, faulty health-check behavior, or an actual SYN flood; the connect() call can be killed before it even reaches the application.

This article is a runbook that makes SYN backlog + accept queue behavior operationally legible and is usable both for fast intervention and for lasting resilience.

1) Mental model: two distinct queues

A TCP listener has two separate “waiting areas”:

  • SYN backlog (half-open connections): SYN arrived → SYN/ACK sent → waiting for ACK.
  • Accept queue (completed connections): 3-way handshake done → the application will pick it up via accept().

The two have different limits and different failure modes. So deciding only by the listen(backlog) value is often misleading.

2) Triage: a 5-minute checklist

2.1 Symptom verification

Typical client-side symptoms:

  • connect timeout, i/o timeout, upstream connect error
  • 5xx increase (at the proxy/LB layer)
  • It’s not latency, but failed connection ratio that rises

Quick check on the server:

# Listening sockets and backlog indicators
ss -ltnp

# Summary queue of listening sockets
ss -ltn

If Recv-Q (accept queue) is rising, the app may not be keeping up with accept().

2.2 Are SYN cookies engaged?

sysctl net.ipv4.tcp_syncookies
  • 0: off (earlier collapse under load)
  • 1: on (engages when SYN backlog fills)

2.3 Look at kernel counters (evidence)

netstat -s | rg -n "listen|SYN|cookie|overflow|drop|retrans" -S

Field interpretation:

  • Counters like “listen queue overflow”: accept queue is filling
  • SYNs to LISTEN sockets dropped” / “SYN cookies sent”: SYN backlog pressure

Note: counter names may vary by kernel version; the goal isn’t to memorize the text but to capture the overflow/drop/cookie triad.

3) Quick mitigation (during an incident)

Priority: stabilize connectivity, then isolate the root cause.

3.1 Enable SYN cookies (controlled)

sudo sysctl -w net.ipv4.tcp_syncookies=1

To persist:

echo 'net.ipv4.tcp_syncookies = 1' | sudo tee /etc/sysctl.d/99-syncookies.conf
sudo sysctl --system

3.2 Raise backlog limits

Even if the app’s listen(backlog) is high, the effect stays limited if kernel ceilings are tight.

sudo sysctl -w net.core.somaxconn=4096
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=8192

Field notes:

  • somaxconn: ceiling for the accept queue
  • tcp_max_syn_backlog: ceiling for the SYN backlog (especially under heavy SYN load)

3.3 Fix LB/edge behavior (biggest win is here)

The most expensive mistake: the health-check increasing traffic during the issue.

Operational checklist:

  • Is the health-check frequency too aggressive?
  • Are multiple layers retrying at the same time? (LB + gateway + client)
  • Is traffic “sticking” to a single problematic pod/host?

When possible during the incident:

  • Reduce retry counts / apply a retry budget
  • Raise the health-check threshold (reduce flap)
  • Distribute traffic gradually (weight/priority)

4) Lasting resilience: design choices

4.1 “Growing the backlog” is not a strategy on its own

The backlog is just a buffer. The real goals:

  • Cut SYN pressure at the edge (rate limit / SYN proxy / WAF)
  • Increase accept rate (event loop, accept() flow, threading model)
  • Anticipate burst patterns (campaigns, batch, cron)

4.2 Observability: which metrics should alert?

Recommended signals:

  • SYN cookie usage ratio / counter
  • Listen/accept overflow counters
  • SYN-RECV state ratio (sudden rise)
  • LB “connect error” rate (upstream connect failure)

5) Runbook closing: verification and recovery

Post-mitigation verification:

  • Is it not 401/5xx, but specifically the connect error rate that’s dropping?
  • Are SYN cookie counters declining?
  • Has Recv-Q returned to normal in ss output?

Recovery:

  • Before promoting temporary sysctls to permanent, clarify the capacity bump and the edge policy.
  • Write a rollback plan: previous values, change time, rollback command.

Final note: This incident class often starts as “the app is slow” but is most of the time network and kernel behavior. Operational leadership is about establishing the right reflex at the right layer: evidence → mitigation → permanent design.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts