In production, “a small bit of latency” rarely produces an outage by itself. What grows the outage is usually retry behavior. Retry sounds like “resilience” on paper, but applied in the wrong place with the wrong budget, it loads the system even more.
This piece clarifies three concepts:
- Timeout budget: the total time budget for a request
- Retry budget: the extra attempts allocated for retries
- Latency amplification: how a small delay grows across the entire system
1) Why does retry make the outage worse?
Simple example:
- Service B’s normal p95 is 80ms
- Something goes wrong and p95 climbs to 400ms
- Service A is configured with a 300ms timeout + 2 retries
In this case, A produces more traffic toward B; because B is already slow, it gets even slower. This is a vicious cycle:
- Latency rises
- Timeouts/retries fire
- Traffic rises
- Latency rises further
2) Timeout budget: design the chain end-to-end
Treating the timeout as “a single number” is a mistake. In distributed requests, the timeout budget gets carved up:
- Client total budget (e.g. 800ms)
- Gateway/edge budget (e.g. 700ms)
- App budget (e.g. 600ms)
- Downstream calls (e.g. 2x 250ms)
Practical rules:
- The upper layer’s timeout must be larger than the lower layer’s timeout.
- The lower layer needs “deadline propagation” (carry the remaining time downstream).
3) Retry budget: not “how many” but “in which case”?
In production, safe retry only makes sense under these conditions:
- The request is idempotent (like GET) or protected by an idempotency key
- The error type is transient (e.g. connection reset)
- Backoff + jitter is in place
- The system is not saturated (the retry budget tightens dynamically)
Just saying “2 retries” is not enough. What matters is:
- Retry on which error codes?
- Retry on which endpoints?
- Retry for which client segment?
4) Guardrails that limit latency amplification
The guardrail set that helps the most in the field:
- Backoff + jitter: spreads retries out
- Concurrency limit: caps how much work is in flight
- Queue + drop policy: prevents unbounded queue growth
- Circuit breaker: gives the system breathing room via fast-fail
- Load shedding: rejects low-priority work early (429/503)
Without these guardrails, retry collapses into “everyone retries at the same time.”
5) Practical response during an incident
If you suspect a retry storm:
- Reduce or disable retry (especially for non-idempotent operations)
- Before “shortening” timeouts, check deadline propagation first
- Lower the concurrency limit and bring the queue under control
- Watch the 429/503 ratio: failing early can reduce total damage
- Verify exponential backoff + jitter on the client side
The goal here is not “force more requests through” — it is to protect the overall health of the system and bring it back to a stable state.
When retry is set up correctly, it produces resilience. When it is set up badly, it grows the outage. Success in production comes from designing the timeout budget, retry budget, and backpressure together.