The most expensive failure mode in production is not “everything is down”; it is uncontrolled collapse. Traffic spikes, a dependency slows down, the thread pool fills up, the queue swells… and suddenly the system starts losing everything at once. In this piece I want to talk about flipping that scenario, at the design level, into “controlled loss”: SLO-based degrade modes and load shedding.
What does “degrade mode” mean?
A degrade mode is the system, while under pressure, giving up on the claim of “everything at the same quality” and accepting, in a predefined way, that some features will be cut back.
Some example degrade goals:
- Hold p95 latency, switch off something expensive like “recommendation”
- Protect the payment flow, delay reporting and search results
- Throttle admin/back-office endpoints, protect customer-facing endpoints
Load shedding: what do I “refuse”?
Load shedding answers two questions:
- Which requests am I refusing?
- On which signal do I start (or stop) refusing them?
The order I usually prefer:
- Low-priority batch: background jobs (recompute, refresh, export)
- Best-effort API: “nice-to-have” endpoints
- Anonymous traffic: unauthenticated / un-rate-limited entry points
- Misbehaving clients: clients producing a faulty retry storm
SLO signal: which metric do I trigger on?
A degrade mode should not be a “panic button” — it should be automation. I split the trigger signals into two groups:
- System signals: p95/p99 latency, error rate, queue depth, conn pool saturation, thread pool utilization
- Business signals: checkout success rate, login success, order placement, critical workflow completion
The goal is to lower false positives and engage at the right moment.
Control surface: where do I drive the degrade mode from?
The model that holds up in production combines these three pieces:
- Traffic shaping: rate limit + priority on ingress (LB / API gateway)
- Feature flags: turn the expensive feature off / fall back to cache
- Queue policy: priority queue + TTL + drop strategy
Trusting a single layer (gateway only, or flags only) is not enough. Real systems are layered.
Decision matrix: “when do I throttle what?”
Putting the matrix below into the runbook removes a lot of debate during an incident:
- Latency rising, errors low → make caching more aggressive, rate-limit the expensive endpoint
- Errors rising, dependency timing out → throttle outbound calls to the downstream, harden the retry policy
- Queue growing → lower the TTL, drop low-priority jobs
- Conn pool saturated → drop the concurrency limit, redirect to a read-only replica
A starter pack you can actually ship
You do not need to wait for “the big transformation”; as a first step the following is enough:
- One degrade playbook per critical flow (a list of features to switch off)
- Priority + rate limit at the API gateway (at minimum an anonymous-vs-authenticated split)
- A concurrency limiter at one or two critical points in the application
- For queues: TTL + DLQ + drop policy
- An SLO burn and a “degrade active” panel in observability
Final word: controlled loss is a sign of operational maturity
A degrade mode is not “lowering the quality”; it is a deliberate choice to keep the whole system upright. Once this discipline is in place, the tone of incidents changes: instead of panic you get manageable decisions, predictable impact and a shorter MTTR.