One of the most expensive incidents at the identity layer is this: “Login works, but every service returns 401.” The cause is most often not application code but key rotation: the JWT signing key changes, JWKS gets updated; but clients/gateways keep the old JWKS in their cache.
This runbook is prepared to quickly diagnose “kid mismatch” class errors and to make rotation safe at enterprise scale.
1) Symptoms: when does JWKS suspicion arise?
Typical signals:
- 401/403 rate suddenly spikes
- The problem shows up in all services at the same time (if the gateway/edge does verification)
- Starts within the same minutes as a deploy (IdP/gateway change)
- Logs show “unknown kid”, “no matching key”, “signature verification failed”
2) Triage: produce evidence in 10 minutes
2.1 Which kid is exploding?
Example log search (adjust for your environment):
rg -n "kid|jwks|signature|unknown key|no matching" /var/log -S | head
Goal: capture the kid value from the error message.
2.2 Is the JWKS endpoint really publishing the new key?
On the operations side, the best test is not just “is the endpoint open?” but the kid list.
curl -fsS https://<idp-or-gateway>/.well-known/jwks.json | jq -r '.keys[].kid'
If the failing token’s kid is not in the kid list:
- The rotation might have been done as a “single key” (no overlap)
- The wrong environment may have been deployed
- The CDN/LB layer may be carrying the old JWKS response
2.3 Identify the cache layer (the most critical step)
JWKS is most often cached in these layers:
- API gateway / reverse proxy (Envoy, NGINX, Kong, etc.)
- JWKS cache inside the application (SDK)
- CDN (with the wrong Cache-Control)
3) Quick mitigation: stop the 401 wave
Priority: bring production back. Then do the rotation “correctly.”
3.1 Re-publish the old key (fastest rollback)
If possible, publish the old + the new key together in JWKS (overlap). Thus:
- Even when caches pull the new JWKS, old tokens still validate
- You can do a phased transition
3.2 Temporarily lower JWKS cache duration
Temporary policy:
Cache-Control: max-age=60(or lower)- Shorten the JWKS refresh period on gateways
- Bypass/purge the JWKS path on the CDN
3.3 Fix the “kid” generation strategy
Wrong practice: publishing different keys with the same kid. Combined with caching, this turns into verification chaos.
Right practice:
- A change in
kidmeans the key has changed. - The old
kidstays in JWKS for a while longer (grace period).
4) Permanent solution: safe rotation design
4.1 Define an overlap (dual-key) window
Suggested operational rule:
- “The old key stays in JWKS for at least 2× the maximum token TTL.”
Example:
- Token TTL: 30 min
- Overlap: ≥ 60 min
4.2 Rotation checklist (pre-deploy)
- New key generated, new
kidready - Verified that old + new are published together in JWKS
- JWKS Cache-Control consciously set
- Gateway/JWKS cache refresh interval is known
- Alert and dashboard ready for 401 rate
4.3 Alerts and validation
Recommended metrics:
- 401 rate (gateway + per service)
- “unknown kid” log rate
- JWKS fetch error rate / latency
5) Runbook close-out: post-rotation cleanup
After rotation has stabilized:
- Verify token TTL has expired before removing the old key
- Bring cache settings back to normal (very frequent fetches are unnecessary load too)
- Add the question “which cache layer prolonged it?” to the postmortem
Key rotation is not just a security task; it’s an operational continuity task. Safe organizations manage rotation not as a “do once and forget” task, but as a recurring change that is rehearsed.