İçeriğe Atla
Mustafa Erbay
Technology · 12 min read · görüntülenme Türkçe oku
100%

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic

A practical architecture and operations guide for handling long-lived HTTP/2 connections, idle timeouts, and retry storms without losing your SLO.

Load Balancer, Keepalive, and Retry Budgets for gRPC/HTTP2 Traffic — cover image

gRPC looks like “RPC over HTTP,” but in production its behavior is generally shaped by long-lived HTTP/2 connections. That detail is enough that, if your load balancer choice, idle timeouts, keepalive settings, and retry policies are off, you can lose your SLO “without changing any code.”

In this article I’m describing the gRPC/HTTP2 production problems I see most often, and how I tie them to an operationally safe design: connection draining, keepalive, outlier detection, and most importantly retry budget.

Why is the “connection” critical in gRPC/HTTP2?

In the HTTP/1.1 world, each request is relatively independent. In gRPC, the client typically opens a small number of HTTP/2 connections to a target and carries many streams over those connections.

That means:

  • A single connection reset can affect dozens of streams at once, not a single request.
  • The load balancer’s “rebalancing” behavior can distort traffic distribution in ways you don’t expect.
  • Values like idle timeout / NAT timeout / firewall state timeout can “silently” kill the connection and start a retry storm on the client side.

Common symptoms and root causes

Production gRPC issues usually arrive with these signals:

  • A rise in UNAVAILABLE, DEADLINE_EXCEEDED, RST_STREAM on the client
  • Logs at the L7 proxy like upstream_reset_before_response_started / stream reset
  • Sudden p95/p99 latency jumps even though CPU/DB look normal
  • An error explosion in a particular AZ/POP (network/NAT/idle timeout)

The root causes are typically:

  1. Idle timeout mismatch (LB/ingress/NAT/firewall on different timers)
  2. Wrong keepalive (too aggressive ping = noise / too passive = silent disconnect)
  3. Deploy without draining (pod/VM dies, connection reset)
  4. Uncontrolled retries (thundering herd + amplification)
  5. An LB algorithm that doesn’t fit gRPC (sticky, connection-count, stream-aware, etc.)

If your HTTP/2 connection passes through any stateful device (LB/NAT/firewall), the “lowest idle timeout” practically determines the connection’s lifetime.

Sample chain:

LayerTypical risk
Client → NATSilent drop if NAT state idle is short
Edge LBReset if HTTP/2 idle timeout is short
Service mesh / sidecarRST when there’s no drain
BackendPod restart = stream reset

Keepalive: Solving silent disconnects without turning them into “noise”

Keepalive solves two problems:

  1. Stops stateful devices like NAT/firewall from aging out the idle state
  2. Speeds up the “is the other side alive?” signal

But aggressive keepalive creates a different problem: thousands of clients send pings, burning unnecessary packets and CPU on LB/ingress.

A practical approach:

  • Don’t enable keepalive “everywhere”; enable it on risky segments.
  • The ping interval should be shorter than the shortest idle timeout, but reasonable.
  • Pings can mask “bad networks”; so put metrics on connection drops.

Retry: Resilience or amplification?

Done right, retry hides transient failures from the user. Done wrong, it kills the system at the most critical moment.

The most typical mistakes:

  • Retrying on every error (especially DEADLINE_EXCEEDED and connection resets)
  • No backoff, no jitter
  • No cap on concurrent retries
  • A delayed “failure” signal (client, LB, and backend blame each other)

Retry budget (a practical limit fixed by experience)

Retry budget answers the question: “what percentage of total traffic can I devote to retries?”

The frame that’s worked for me in the field:

  • One retry budget per service (e.g. 5%)
  • If that budget is exceeded: cut retries, reduce timeout, degrade
  • Retry budget metrics: requests_total, retries_total, retries_ratio

Connection draining: Managing “connection death” during deploys

Even a “rolling” deploy can hit hard in gRPC. Because:

  • A pod goes down → TCP FIN/RST → streams cut
  • If the client opens its new connection late, errors burst out

Two axes for draining:

  1. Server-side: graceful shutdown + drain duration
  2. LB/proxy-side: connection draining + health state transition

The operational goal: “stop accepting new streams before shutdown, finish the existing streams, then exit.”

Observability: Minimum metric set for gRPC

I suggest managing gRPC traffic with at least these metrics:

  • grpc_server_handled_total{code=...} (code distribution)
  • Latency: p50/p95/p99 (per method)
  • active_connections and active_streams
  • Retry ratio (client side, when possible)
  • Reset/GOAWAY counts (proxy/LB side)

In the logs:

  • The reason for an upstream reset (timeout, drain, or overload?)
  • The deadline (is the client deadline too short?)

Incident runbook: What do I do when “UNAVAILABLE explodes”?

  1. Regional? (single AZ/POP?)
  2. Timeout chain: did the LB/proxy idle timeout change? Is the NAT/firewall policy current?
  3. Deploy effect: was there a rollout in the last 30 min? Is drain working?
  4. Retry amplification: did retries_ratio rise?
  5. Isolate: canary client → single backend → single path

Quick “damage reduction” moves (in order of risk):

  • Cut retries (with a budget), require backoff+jitter
  • Align deadlines with the service SLO (very short deadline = constant retry)
  • Fix draining (graceful shutdown)
  • Align LB idle timeouts

Conclusion

Resilience in gRPC/HTTP2 doesn’t come from “more retries” or “bigger autoscale”; it comes from managing the connection lifecycle properly. Make the idle timeout chain visible, use keepalive in a controlled way, standardize draining, and always cap retries with a budget.

What makes the difference in production isn’t “gRPC is working”; it’s “gRPC incidents are manageable.”

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts