İçeriğe Atla
Mustafa Erbay
Tutorials · 7 min read · görüntülenme Türkçe oku
100%

Operational Runbook for JWKS Key Rotation

A runbook to triage the 401 wave (kid mismatch/JWKS cache) that occurs during JWT key rotation, and to set up safe overlap/caching strategy.

Operational Runbook for JWKS Key Rotation — cover image

One of the most expensive incidents at the identity layer is this: “Login works, but every service returns 401.” The cause is most often not application code but key rotation: the JWT signing key changes, JWKS gets updated; but clients/gateways keep the old JWKS in their cache.

This runbook is prepared to quickly diagnose “kid mismatch” class errors and to make rotation safe at enterprise scale.

1) Symptoms: when does JWKS suspicion arise?

Typical signals:

  • 401/403 rate suddenly spikes
  • The problem shows up in all services at the same time (if the gateway/edge does verification)
  • Starts within the same minutes as a deploy (IdP/gateway change)
  • Logs show “unknown kid”, “no matching key”, “signature verification failed”

2) Triage: produce evidence in 10 minutes

2.1 Which kid is exploding?

Example log search (adjust for your environment):

rg -n "kid|jwks|signature|unknown key|no matching" /var/log -S | head

Goal: capture the kid value from the error message.

2.2 Is the JWKS endpoint really publishing the new key?

On the operations side, the best test is not just “is the endpoint open?” but the kid list.

curl -fsS https://<idp-or-gateway>/.well-known/jwks.json | jq -r '.keys[].kid'

If the failing token’s kid is not in the kid list:

  • The rotation might have been done as a “single key” (no overlap)
  • The wrong environment may have been deployed
  • The CDN/LB layer may be carrying the old JWKS response

2.3 Identify the cache layer (the most critical step)

JWKS is most often cached in these layers:

  • API gateway / reverse proxy (Envoy, NGINX, Kong, etc.)
  • JWKS cache inside the application (SDK)
  • CDN (with the wrong Cache-Control)

3) Quick mitigation: stop the 401 wave

Priority: bring production back. Then do the rotation “correctly.”

3.1 Re-publish the old key (fastest rollback)

If possible, publish the old + the new key together in JWKS (overlap). Thus:

  • Even when caches pull the new JWKS, old tokens still validate
  • You can do a phased transition

3.2 Temporarily lower JWKS cache duration

Temporary policy:

  • Cache-Control: max-age=60 (or lower)
  • Shorten the JWKS refresh period on gateways
  • Bypass/purge the JWKS path on the CDN

3.3 Fix the “kid” generation strategy

Wrong practice: publishing different keys with the same kid. Combined with caching, this turns into verification chaos.

Right practice:

  • A change in kid means the key has changed.
  • The old kid stays in JWKS for a while longer (grace period).

4) Permanent solution: safe rotation design

4.1 Define an overlap (dual-key) window

Suggested operational rule:

  • “The old key stays in JWKS for at least 2× the maximum token TTL.”

Example:

  • Token TTL: 30 min
  • Overlap: ≥ 60 min

4.2 Rotation checklist (pre-deploy)

  • New key generated, new kid ready
  • Verified that old + new are published together in JWKS
  • JWKS Cache-Control consciously set
  • Gateway/JWKS cache refresh interval is known
  • Alert and dashboard ready for 401 rate

4.3 Alerts and validation

Recommended metrics:

  • 401 rate (gateway + per service)
  • “unknown kid” log rate
  • JWKS fetch error rate / latency

5) Runbook close-out: post-rotation cleanup

After rotation has stabilized:

  • Verify token TTL has expired before removing the old key
  • Bring cache settings back to normal (very frequent fetches are unnecessary load too)
  • Add the question “which cache layer prolonged it?” to the postmortem

Key rotation is not just a security task; it’s an operational continuity task. Safe organizations manage rotation not as a “do once and forget” task, but as a recurring change that is rehearsed.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts