Health Check Blindness in L4 Pools: Failover and Blackholes

On the L4 load balancer (VIP) side, the most dangerous failure mode is this: the health check stays green while user traffic falls off a cliff. Operations sees “pool up,” the application team insists “service down,” and everyone stares at one another. The shorthand for this scenario in the field: health check blindness.

Why does it happen? (the six root causes I see most often)

The check endpoint is not “real work”: /healthz returns 200 even when the DB, queue, or downstream is broken.
Partial degradation: only certain paths, tenants, or regions are affected.
State / NAT / conntrack pressure: connections through the VIP die while fresh TCP probes look “fine.”
Broken return path (DSR / asymmetric routing): the probe leaves through the LB but the response takes a different path and disappears.
Wrong timeout thresholds: the probe waits 2s while the user wants a 200ms SLO; “UP” doesn’t mean “healthy.”
It is not the LB, it is backend behavior: thread pool full, accept queue saturated, GC pressure — and the probe rarely catches it.

Design principle: pair active checks with passive signals

In my L4 VIP designs, I split health checking into two layers:

Active check: cheap and fast; “is the service up?” (even an L7 check should stay minimal)
Passive signal: derived from real traffic; “is the service well?” (errors, latency, connection success)

This approach reduces blindness because checks are not limited to synthetic probe traffic.

Examples of passive signals

5xx ratio / application-level error codes
TCP handshake failures (SYN timeouts, RST surges)
p95 / p99 latency (measured behind the VIP)
Backend connection pool saturation (especially when there is a proxy / L7 layer)
“Outlier detection” (specific nodes consistently performing worse)

Avoiding blackholes: three safe ways to pull traffic away

1) Pool member disable (the classic)

The pool member disable mechanism on the LB is the best-known method. The catch: that disable decision usually rests on a single signal.

A better approach: gate the disable on two conditions:

Active check fails, or
A passive signal crosses a threshold (for example, p95 tripling within a minute, or 5xx exceeding some percentage)

2) Route withdrawal (very clean in BGP / ECMP environments)

If you announce the VIP or service prefix via BGP, withdrawing the route is an extremely effective way to drain traffic. This pattern is a lifesaver, especially for anycast VIPs.

A key rule: do not trigger withdrawal from a “script,” but from a measurable signal (e.g. local proxy error rate combined with connection success).

3) L7 outlier ejection (when you have a proxy layer)

If you run an L7 layer such as Envoy, HAProxy, or Nginx, outlier ejection lets you eject “bad nodes” automatically. It produces richer signals than an L4 LB on its own.

What does a healthy health check endpoint look like?

My practical guideline for endpoint design:

Readiness: 200 when critical dependencies are fine; 503 when they aren’t (so the LB can pull traffic)
Liveness: is the process alive? (a separate concern, mostly for orchestrators like Kubernetes)

Readiness should exercise at least one real dependency:

A DB connection (a simple SELECT)
Queue publish/consume (with a short timeout)
Cache access (when applicable)

But run these checks without slowing the probe: short timeouts, fail fast, and instrument them.

Operations: a quick triage flow during an incident

When the health check is green but traffic has dropped, here is the order I follow:

Are TCP handshakes from the VIP to the backend succeeding? (SYN/SYN-ACK ratio, RST surges)
Is the backend’s accept queue / thread pool saturated?
If asymmetric routing or DSR is in play, check the return path (policy routes, firewall state)
On the LB device or service, look at conntrack capacity and “drops” counters
Passive signals: are specific backend nodes misbehaving, or all of them?

This sequence turns the “is it the LB or the app?” debate into evidence within 5 to 10 minutes.

Conclusion: failover is a product behavior, not a checkbox

Good failover does not just mean “traffic moves when a node goes down.” The real value is rescuing traffic when a node looks “up” but is actually broken. Once you complement active checks with passive signals, health check blindness shrinks; incidents close faster, and you trade blame for evidence.

Health Check Blindness in L4 Pools: Failover and Blackholes

Why does it happen? (the six root causes I see most often)

Design principle: pair active checks with passive signals

Examples of passive signals

Avoiding blackholes: three safe ways to pull traffic away

1) Pool member disable (the classic)

2) Route withdrawal (very clean in BGP / ECMP environments)

3) L7 outlier ejection (when you have a proxy layer)

What does a healthy health check endpoint look like?

Operations: a quick triage flow during an incident

Conclusion: failover is a product behavior, not a checkbox

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

Sticky Sessions and Load Balancer Decisions for Stateful Traffic

Feature Flags and Configuration Governance: Parameter Store and Audit

Why does it happen? (the six root causes I see most often)

Design principle: pair active checks with passive signals

Examples of passive signals

Avoiding blackholes: three safe ways to pull traffic away

1) Pool member disable (the classic)

2) Route withdrawal (very clean in BGP / ECMP environments)

3) L7 outlier ejection (when you have a proxy layer)

What does a healthy health check endpoint look like?

Operations: a quick triage flow during an incident

Conclusion: failover is a product behavior, not a checkbox

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

Sticky Sessions and Load Balancer Decisions for Stateful Traffic

Feature Flags and Configuration Governance: Parameter Store and Audit

Klavye Kısayolları