İçeriğe Atla
Mustafa Erbay
Technology · 9 min read · görüntülenme Türkçe oku
100%

Retry Storms: Timeout Budget and Latency Amplification

In distributed systems, badly designed retries make outages worse. An approach to limiting damage with timeout budgets, retry budgets, and backpressure.

Retry Storms: Timeout Budget and Latency Amplification — cover image

In production, “a small bit of latency” rarely produces an outage by itself. What grows the outage is usually retry behavior. Retry sounds like “resilience” on paper, but applied in the wrong place with the wrong budget, it loads the system even more.

This piece clarifies three concepts:

  • Timeout budget: the total time budget for a request
  • Retry budget: the extra attempts allocated for retries
  • Latency amplification: how a small delay grows across the entire system

1) Why does retry make the outage worse?

Simple example:

  • Service B’s normal p95 is 80ms
  • Something goes wrong and p95 climbs to 400ms
  • Service A is configured with a 300ms timeout + 2 retries

In this case, A produces more traffic toward B; because B is already slow, it gets even slower. This is a vicious cycle:

  1. Latency rises
  2. Timeouts/retries fire
  3. Traffic rises
  4. Latency rises further

2) Timeout budget: design the chain end-to-end

Treating the timeout as “a single number” is a mistake. In distributed requests, the timeout budget gets carved up:

  • Client total budget (e.g. 800ms)
  • Gateway/edge budget (e.g. 700ms)
  • App budget (e.g. 600ms)
  • Downstream calls (e.g. 2x 250ms)

Practical rules:

  • The upper layer’s timeout must be larger than the lower layer’s timeout.
  • The lower layer needs “deadline propagation” (carry the remaining time downstream).

3) Retry budget: not “how many” but “in which case”?

In production, safe retry only makes sense under these conditions:

  • The request is idempotent (like GET) or protected by an idempotency key
  • The error type is transient (e.g. connection reset)
  • Backoff + jitter is in place
  • The system is not saturated (the retry budget tightens dynamically)

Just saying “2 retries” is not enough. What matters is:

  • Retry on which error codes?
  • Retry on which endpoints?
  • Retry for which client segment?

4) Guardrails that limit latency amplification

The guardrail set that helps the most in the field:

  • Backoff + jitter: spreads retries out
  • Concurrency limit: caps how much work is in flight
  • Queue + drop policy: prevents unbounded queue growth
  • Circuit breaker: gives the system breathing room via fast-fail
  • Load shedding: rejects low-priority work early (429/503)

Without these guardrails, retry collapses into “everyone retries at the same time.”

5) Practical response during an incident

If you suspect a retry storm:

  1. Reduce or disable retry (especially for non-idempotent operations)
  2. Before “shortening” timeouts, check deadline propagation first
  3. Lower the concurrency limit and bring the queue under control
  4. Watch the 429/503 ratio: failing early can reduce total damage
  5. Verify exponential backoff + jitter on the client side

The goal here is not “force more requests through” — it is to protect the overall health of the system and bring it back to a stable state.

When retry is set up correctly, it produces resilience. When it is set up badly, it grows the outage. Success in production comes from designing the timeout budget, retry budget, and backpressure together.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts