İçeriğe Atla
Mustafa Erbay
Technology · 11 min read · görüntülenme Türkçe oku
100%

SLO-Based Degrade Modes and Load Shedding

Producing controlled loss instead of a random collapse when a system is under pressure: rate limits, queues, feature flags and prioritization.

SLO-Based Degrade Modes and Load Shedding — cover image

The most expensive failure mode in production is not “everything is down”; it is uncontrolled collapse. Traffic spikes, a dependency slows down, the thread pool fills up, the queue swells… and suddenly the system starts losing everything at once. In this piece I want to talk about flipping that scenario, at the design level, into “controlled loss”: SLO-based degrade modes and load shedding.

What does “degrade mode” mean?

A degrade mode is the system, while under pressure, giving up on the claim of “everything at the same quality” and accepting, in a predefined way, that some features will be cut back.

Some example degrade goals:

  • Hold p95 latency, switch off something expensive like “recommendation”
  • Protect the payment flow, delay reporting and search results
  • Throttle admin/back-office endpoints, protect customer-facing endpoints

Load shedding: what do I “refuse”?

Load shedding answers two questions:

  1. Which requests am I refusing?
  2. On which signal do I start (or stop) refusing them?

The order I usually prefer:

  1. Low-priority batch: background jobs (recompute, refresh, export)
  2. Best-effort API: “nice-to-have” endpoints
  3. Anonymous traffic: unauthenticated / un-rate-limited entry points
  4. Misbehaving clients: clients producing a faulty retry storm

SLO signal: which metric do I trigger on?

A degrade mode should not be a “panic button” — it should be automation. I split the trigger signals into two groups:

  • System signals: p95/p99 latency, error rate, queue depth, conn pool saturation, thread pool utilization
  • Business signals: checkout success rate, login success, order placement, critical workflow completion

The goal is to lower false positives and engage at the right moment.

Control surface: where do I drive the degrade mode from?

The model that holds up in production combines these three pieces:

  • Traffic shaping: rate limit + priority on ingress (LB / API gateway)
  • Feature flags: turn the expensive feature off / fall back to cache
  • Queue policy: priority queue + TTL + drop strategy

Trusting a single layer (gateway only, or flags only) is not enough. Real systems are layered.

Decision matrix: “when do I throttle what?”

Putting the matrix below into the runbook removes a lot of debate during an incident:

  1. Latency rising, errors low → make caching more aggressive, rate-limit the expensive endpoint
  2. Errors rising, dependency timing out → throttle outbound calls to the downstream, harden the retry policy
  3. Queue growing → lower the TTL, drop low-priority jobs
  4. Conn pool saturated → drop the concurrency limit, redirect to a read-only replica

A starter pack you can actually ship

You do not need to wait for “the big transformation”; as a first step the following is enough:

  • One degrade playbook per critical flow (a list of features to switch off)
  • Priority + rate limit at the API gateway (at minimum an anonymous-vs-authenticated split)
  • A concurrency limiter at one or two critical points in the application
  • For queues: TTL + DLQ + drop policy
  • An SLO burn and a “degrade active” panel in observability

Final word: controlled loss is a sign of operational maturity

A degrade mode is not “lowering the quality”; it is a deliberate choice to keep the whole system upright. Once this discipline is in place, the tone of incidents changes: instead of panic you get manageable decisions, predictable impact and a shorter MTTR.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts