Operational Runbook for JWKS Key Rotation

One of the most expensive incidents at the identity layer is this: “Login works, but every service returns 401.” The cause is most often not application code but key rotation: the JWT signing key changes, JWKS gets updated; but clients/gateways keep the old JWKS in their cache.

This runbook is prepared to quickly diagnose “kid mismatch” class errors and to make rotation safe at enterprise scale.

1) Symptoms: when does JWKS suspicion arise?

Typical signals:

401/403 rate suddenly spikes
The problem shows up in all services at the same time (if the gateway/edge does verification)
Starts within the same minutes as a deploy (IdP/gateway change)
Logs show “unknown kid”, “no matching key”, “signature verification failed”

2) Triage: produce evidence in 10 minutes

2.1 Which `kid` is exploding?

Example log search (adjust for your environment):

rg -n "kid|jwks|signature|unknown key|no matching" /var/log -S | head

Goal: capture the kid value from the error message.

2.2 Is the JWKS endpoint really publishing the new key?

On the operations side, the best test is not just “is the endpoint open?” but the kid list.

curl -fsS https://<idp-or-gateway>/.well-known/jwks.json | jq -r '.keys[].kid'

If the failing token’s kid is not in the kid list:

The rotation might have been done as a “single key” (no overlap)
The wrong environment may have been deployed
The CDN/LB layer may be carrying the old JWKS response

2.3 Identify the cache layer (the most critical step)

JWKS is most often cached in these layers:

API gateway / reverse proxy (Envoy, NGINX, Kong, etc.)
JWKS cache inside the application (SDK)
CDN (with the wrong Cache-Control)

3) Quick mitigation: stop the 401 wave

Priority: bring production back. Then do the rotation “correctly.”

3.1 Re-publish the old key (fastest rollback)

If possible, publish the old + the new key together in JWKS (overlap). Thus:

Even when caches pull the new JWKS, old tokens still validate
You can do a phased transition

3.2 Temporarily lower JWKS cache duration

Temporary policy:

Cache-Control: max-age=60 (or lower)
Shorten the JWKS refresh period on gateways
Bypass/purge the JWKS path on the CDN

3.3 Fix the “kid” generation strategy

Wrong practice: publishing different keys with the same kid. Combined with caching, this turns into verification chaos.

Right practice:

A change in kid means the key has changed.
The old kid stays in JWKS for a while longer (grace period).

4) Permanent solution: safe rotation design

4.1 Define an overlap (dual-key) window

Suggested operational rule:

“The old key stays in JWKS for at least 2× the maximum token TTL.”

Example:

Token TTL: 30 min
Overlap: ≥ 60 min

4.2 Rotation checklist (pre-deploy)

New key generated, new kid ready
Verified that old + new are published together in JWKS
JWKS Cache-Control consciously set
Gateway/JWKS cache refresh interval is known
Alert and dashboard ready for 401 rate

4.3 Alerts and validation

Recommended metrics:

401 rate (gateway + per service)
“unknown kid” log rate
JWKS fetch error rate / latency

5) Runbook close-out: post-rotation cleanup

After rotation has stabilized:

Verify token TTL has expired before removing the old key
Bring cache settings back to normal (very frequent fetches are unnecessary load too)
Add the question “which cache layer prolonged it?” to the postmortem

Key rotation is not just a security task; it’s an operational continuity task. Safe organizations manage rotation not as a “do once and forget” task, but as a recurring change that is rehearsed.

Operational Runbook for JWKS Key Rotation

1) Symptoms: when does JWKS suspicion arise?

2) Triage: produce evidence in 10 minutes

2.1 Which `kid` is exploding?

2.2 Is the JWKS endpoint really publishing the new key?

2.3 Identify the cache layer (the most critical step)

3) Quick mitigation: stop the 401 wave

3.1 Re-publish the old key (fastest rollback)

3.2 Temporarily lower JWKS cache duration

3.3 Fix the “kid” generation strategy

4) Permanent solution: safe rotation design

4.1 Define an overlap (dual-key) window

4.2 Rotation checklist (pre-deploy)

4.3 Alerts and validation

5) Runbook close-out: post-rotation cleanup

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Control Plane Certificate Expiry: A Runbook

Golden Image Pipeline with Packer: CIS Baseline and Patch Strategy

Server Inventory and Security Signals with FleetDM + osquery

1) Symptoms: when does JWKS suspicion arise?

2) Triage: produce evidence in 10 minutes

2.1 Which kid is exploding?

2.2 Is the JWKS endpoint really publishing the new key?

2.3 Identify the cache layer (the most critical step)

3) Quick mitigation: stop the 401 wave

3.1 Re-publish the old key (fastest rollback)

3.2 Temporarily lower JWKS cache duration

3.3 Fix the “kid” generation strategy

4) Permanent solution: safe rotation design

4.1 Define an overlap (dual-key) window

4.2 Rotation checklist (pre-deploy)

4.3 Alerts and validation

5) Runbook close-out: post-rotation cleanup

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Control Plane Certificate Expiry: A Runbook

Golden Image Pipeline with Packer: CIS Baseline and Patch Strategy

Server Inventory and Security Signals with FleetDM + osquery

Klavye Kısayolları

2.1 Which `kid` is exploding?