Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

gRPC looks like “RPC over HTTP,” but in production its behavior is generally shaped by long-lived HTTP/2 connections. That detail is enough that, if your load balancer choice, idle timeouts, keepalive settings, and retry policies are off, you can lose your SLO “without changing any code.”

In this article I’m describing the gRPC/HTTP2 production problems I see most often, and how I tie them to an operationally safe design: connection draining, keepalive, outlier detection, and most importantly retry budget.

Why is the “connection” critical in gRPC/HTTP2?

In the HTTP/1.1 world, each request is relatively independent. In gRPC, the client typically opens a small number of HTTP/2 connections to a target and carries many streams over those connections.

That means:

A single connection reset can affect dozens of streams at once, not a single request.
The load balancer’s “rebalancing” behavior can distort traffic distribution in ways you don’t expect.
Values like idle timeout / NAT timeout / firewall state timeout can “silently” kill the connection and start a retry storm on the client side.

Common symptoms and root causes

Production gRPC issues usually arrive with these signals:

A rise in UNAVAILABLE, DEADLINE_EXCEEDED, RST_STREAM on the client
Logs at the L7 proxy like upstream_reset_before_response_started / stream reset
Sudden p95/p99 latency jumps even though CPU/DB look normal
An error explosion in a particular AZ/POP (network/NAT/idle timeout)

The root causes are typically:

Idle timeout mismatch (LB/ingress/NAT/firewall on different timers)
Wrong keepalive (too aggressive ping = noise / too passive = silent disconnect)
Deploy without draining (pod/VM dies, connection reset)
Uncontrolled retries (thundering herd + amplification)
An LB algorithm that doesn’t fit gRPC (sticky, connection-count, stream-aware, etc.)

The idle timeout chain: the weakest link sets the rule

If your HTTP/2 connection passes through any stateful device (LB/NAT/firewall), the “lowest idle timeout” practically determines the connection’s lifetime.

Sample chain:

Layer	Typical risk
Client → NAT	Silent drop if NAT state idle is short
Edge LB	Reset if HTTP/2 idle timeout is short
Service mesh / sidecar	RST when there’s no drain
Backend	Pod restart = stream reset

Keepalive: Solving silent disconnects without turning them into “noise”

Keepalive solves two problems:

Stops stateful devices like NAT/firewall from aging out the idle state
Speeds up the “is the other side alive?” signal

But aggressive keepalive creates a different problem: thousands of clients send pings, burning unnecessary packets and CPU on LB/ingress.

A practical approach:

Don’t enable keepalive “everywhere”; enable it on risky segments.
The ping interval should be shorter than the shortest idle timeout, but reasonable.
Pings can mask “bad networks”; so put metrics on connection drops.

Retry: Resilience or amplification?

Done right, retry hides transient failures from the user. Done wrong, it kills the system at the most critical moment.

The most typical mistakes:

Retrying on every error (especially DEADLINE_EXCEEDED and connection resets)
No backoff, no jitter
No cap on concurrent retries
A delayed “failure” signal (client, LB, and backend blame each other)

Retry budget (a practical limit fixed by experience)

Retry budget answers the question: “what percentage of total traffic can I devote to retries?”

The frame that’s worked for me in the field:

One retry budget per service (e.g. 5%)
If that budget is exceeded: cut retries, reduce timeout, degrade
Retry budget metrics: requests_total, retries_total, retries_ratio

Connection draining: Managing “connection death” during deploys

Even a “rolling” deploy can hit hard in gRPC. Because:

A pod goes down → TCP FIN/RST → streams cut
If the client opens its new connection late, errors burst out

Two axes for draining:

Server-side: graceful shutdown + drain duration
LB/proxy-side: connection draining + health state transition

The operational goal: “stop accepting new streams before shutdown, finish the existing streams, then exit.”

Observability: Minimum metric set for gRPC

I suggest managing gRPC traffic with at least these metrics:

grpc_server_handled_total{code=...} (code distribution)
Latency: p50/p95/p99 (per method)
active_connections and active_streams
Retry ratio (client side, when possible)
Reset/GOAWAY counts (proxy/LB side)

In the logs:

The reason for an upstream reset (timeout, drain, or overload?)
The deadline (is the client deadline too short?)

Incident runbook: What do I do when “UNAVAILABLE explodes”?

Regional? (single AZ/POP?)
Timeout chain: did the LB/proxy idle timeout change? Is the NAT/firewall policy current?
Deploy effect: was there a rollout in the last 30 min? Is drain working?
Retry amplification: did retries_ratio rise?
Isolate: canary client → single backend → single path

Quick “damage reduction” moves (in order of risk):

Cut retries (with a budget), require backoff+jitter
Align deadlines with the service SLO (very short deadline = constant retry)
Fix draining (graceful shutdown)
Align LB idle timeouts

Conclusion

Resilience in gRPC/HTTP2 doesn’t come from “more retries” or “bigger autoscale”; it comes from managing the connection lifecycle properly. Make the idle timeout chain visible, use keepalive in a controlled way, standardize draining, and always cap retries with a budget.

What makes the difference in production isn’t “gRPC is working”; it’s “gRPC incidents are manageable.”

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

Why is the “connection” critical in gRPC/HTTP2?

Common symptoms and root causes

The idle timeout chain: the weakest link sets the rule

Keepalive: Solving silent disconnects without turning them into “noise”

Retry: Resilience or amplification?

Retry budget (a practical limit fixed by experience)

Connection draining: Managing “connection death” during deploys

Observability: Minimum metric set for gRPC

Incident runbook: What do I do when “UNAVAILABLE explodes”?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Health Check Blindness in L4 Pools: Failover and Blackholes

Hunting Silent Packet Loss During MLAG Failover

Preventing Edge Outages with BGP Max-Prefix Limits

Why is the “connection” critical in gRPC/HTTP2?

Common symptoms and root causes

The idle timeout chain: the weakest link sets the rule

Keepalive: Solving silent disconnects without turning them into “noise”

Retry: Resilience or amplification?

Retry budget (a practical limit fixed by experience)

Connection draining: Managing “connection death” during deploys

Observability: Minimum metric set for gRPC

Incident runbook: What do I do when “UNAVAILABLE explodes”?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Health Check Blindness in L4 Pools: Failover and Blackholes

Hunting Silent Packet Loss During MLAG Failover

Preventing Edge Outages with BGP Max-Prefix Limits

Klavye Kısayolları