İçeriğe Atla
Mustafa Erbay
Technology · 11 min read · görüntülenme Türkçe oku
100%

Health Check Blindness in L4 Pools: Failover and Blackholes

When pool members appear 'UP' but traffic vanishes, combining active checks with passive signals to design failover that actually reflects reality.

Health Check Blindness in L4 Pools: Failover and Blackholes — cover image

On the L4 load balancer (VIP) side, the most dangerous failure mode is this: the health check stays green while user traffic falls off a cliff. Operations sees “pool up,” the application team insists “service down,” and everyone stares at one another. The shorthand for this scenario in the field: health check blindness.

Why does it happen? (the six root causes I see most often)

  1. The check endpoint is not “real work”: /healthz returns 200 even when the DB, queue, or downstream is broken.
  2. Partial degradation: only certain paths, tenants, or regions are affected.
  3. State / NAT / conntrack pressure: connections through the VIP die while fresh TCP probes look “fine.”
  4. Broken return path (DSR / asymmetric routing): the probe leaves through the LB but the response takes a different path and disappears.
  5. Wrong timeout thresholds: the probe waits 2s while the user wants a 200ms SLO; “UP” doesn’t mean “healthy.”
  6. It is not the LB, it is backend behavior: thread pool full, accept queue saturated, GC pressure — and the probe rarely catches it.

Design principle: pair active checks with passive signals

In my L4 VIP designs, I split health checking into two layers:

  • Active check: cheap and fast; “is the service up?” (even an L7 check should stay minimal)
  • Passive signal: derived from real traffic; “is the service well?” (errors, latency, connection success)

This approach reduces blindness because checks are not limited to synthetic probe traffic.

Examples of passive signals

  • 5xx ratio / application-level error codes
  • TCP handshake failures (SYN timeouts, RST surges)
  • p95 / p99 latency (measured behind the VIP)
  • Backend connection pool saturation (especially when there is a proxy / L7 layer)
  • “Outlier detection” (specific nodes consistently performing worse)

Avoiding blackholes: three safe ways to pull traffic away

1) Pool member disable (the classic)

The pool member disable mechanism on the LB is the best-known method. The catch: that disable decision usually rests on a single signal.

A better approach: gate the disable on two conditions:

  • Active check fails, or
  • A passive signal crosses a threshold (for example, p95 tripling within a minute, or 5xx exceeding some percentage)

2) Route withdrawal (very clean in BGP / ECMP environments)

If you announce the VIP or service prefix via BGP, withdrawing the route is an extremely effective way to drain traffic. This pattern is a lifesaver, especially for anycast VIPs.

A key rule: do not trigger withdrawal from a “script,” but from a measurable signal (e.g. local proxy error rate combined with connection success).

3) L7 outlier ejection (when you have a proxy layer)

If you run an L7 layer such as Envoy, HAProxy, or Nginx, outlier ejection lets you eject “bad nodes” automatically. It produces richer signals than an L4 LB on its own.

What does a healthy health check endpoint look like?

My practical guideline for endpoint design:

  • Readiness: 200 when critical dependencies are fine; 503 when they aren’t (so the LB can pull traffic)
  • Liveness: is the process alive? (a separate concern, mostly for orchestrators like Kubernetes)

Readiness should exercise at least one real dependency:

  • A DB connection (a simple SELECT)
  • Queue publish/consume (with a short timeout)
  • Cache access (when applicable)

But run these checks without slowing the probe: short timeouts, fail fast, and instrument them.

Operations: a quick triage flow during an incident

When the health check is green but traffic has dropped, here is the order I follow:

  1. Are TCP handshakes from the VIP to the backend succeeding? (SYN/SYN-ACK ratio, RST surges)
  2. Is the backend’s accept queue / thread pool saturated?
  3. If asymmetric routing or DSR is in play, check the return path (policy routes, firewall state)
  4. On the LB device or service, look at conntrack capacity and “drops” counters
  5. Passive signals: are specific backend nodes misbehaving, or all of them?

This sequence turns the “is it the LB or the app?” debate into evidence within 5 to 10 minutes.

Conclusion: failover is a product behavior, not a checkbox

Good failover does not just mean “traffic moves when a node goes down.” The real value is rescuing traffic when a node looks “up” but is actually broken. Once you complement active checks with passive signals, health check blindness shrinks; incidents close faster, and you trade blame for evidence.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts