İçeriğe Atla
Mustafa Erbay
Tutorials kubernetes-uretim-guvenlik · 12 min read · görüntülenme Türkçe oku
100%

Kubernetes Admission Webhook Timeouts: A Runbook for Frozen Deploys

Field runbook to rapidly triage hung deploys caused by Validating/Mutating webhook latency and apply a risk-controlled mitigation.

Kubernetes Admission Webhook Timeouts: A Runbook for Frozen Deploys — cover image

Kubernetes admission webhooks (Validating/Mutating) are very valuable for security and governance, but in production the failure mode I encounter most often is this: a single webhook slows down and the cluster effectively becomes “non-deployable.” The reason is that the API Server waits for a webhook response inside the CREATE/UPDATE flow.

The goal of this runbook is simple: shrink “why is my deploy hanging?” down to minutes and apply a risk-controlled mitigation.

Symptoms (rapid diagnosis during an incident)

  • kubectl apply / helm upgrade runs unusually long or times out
  • Error from server (InternalError): ... calling webhook ... context deadline exceeded
  • API Server latency climbs and a wave of 5xx responses appears
  • Deploys succeed in some namespaces but fail in others (match/selectors)

0) Safety note: “mitigation” = accepting risk

Bypassing webhooks (e.g. failurePolicy: Ignore) often “saves the service,” but it carries security/regulatory consequences.

1) Rapid triage: which webhook is the bottleneck?

A) Pull the webhook name from the error message

Timeout errors typically include ... calling webhook "<name>" .... That name corresponds to a webhook entry inside a ValidatingWebhookConfiguration or MutatingWebhookConfiguration.

kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

B) Which service endpoint? (DNS/TLS/Network suspicion)

Open the webhook config and inspect the clientConfig.service field:

kubectl get validatingwebhookconfiguration <cfg> -o yaml | rg -n "clientConfig|service|url|caBundle|timeoutSeconds|failurePolicy"

Checklist:

  • Is the service name correct?
  • Is the namespace correct?
  • Is the caBundle current? (this is often broken after certificate rotation)
  • Is timeoutSeconds set too low? (extremely low values cause false positives, very high values lock up the control plane)

2) Five most common root causes of “deploy lockup”

  1. Webhook pods are down/evicted (no PDB, node death, OOM)
  2. DNS issues (kube-dns/CoreDNS latency, upstream timeouts)
  3. TLS/CA bundle drift (cert was renewed, webhook config wasn’t updated)
  4. NetworkPolicy/egress rules cut off access to the webhook
  5. The webhook implementation itself is slow (CPU saturation, cold start, downstream dependency)

3) Mitigation options (lowest to highest risk)

Option 1 — drop the timeout, fail fast (not the safest, but the most deterministic)

Goal: stop the API Server from blocking for minutes.

  • timeoutSeconds: a sensible value (e.g. 2–5s)
  • When the webhook is slow, fast failure rejects deploys quickly and the control plane gets to breathe.

This approach does not fix the service, but it makes the incident behavior predictable.

Option 2 — exclude only specific namespaces/objects (delicate but ideal)

If the webhook rules support selector/match conditions:

  • Disable only the problematic check
  • For instance, exclude kube-system or specific team=platform namespaces

This model narrows the blast radius instead of “bypass everything.”

Option 3 — temporary failurePolicy: Ignore (highest operational speed, highest risk)

This option unblocks deploys but removes the guardrail. Before applying it:

  • For how long? (e.g. 30–60 min)
  • Who approved it? (IC + security owner)
  • What gets monitored? (audit/alert)

4) Remediation: make the webhook production-grade (the lasting fix)

A) Availability

  • At least 2 replicas, plus PDB and HPA
  • Node affinity/anti-affinity (don’t pile them onto one node)
  • Correct liveness/readiness (the API Server shouldn’t reach an “unready” endpoint)

B) Performance

  • Reduce the webhook’s downstream dependencies (cache/timeout any DB or remote calls)
  • Measure cold start (especially for big Java/Go images)

C) TLS/CA management

  • Automate certificate rotation
  • Don’t let caBundle updates drift

D) Observability

  • Webhook latency histogram
  • Error rate + timeout count
  • Admission latency on the API Server side (trace it if possible)

5) “Deploy lockup” runbook flow (summary)

  1. Pull the webhook name from the error message
  2. Validate endpoint/DNS/TLS via clientConfig
  3. Pick the lowest-risk mitigation (timeout/selector → Ignore as the last resort)
  4. Fix the root cause around webhook availability/perf/tls
  5. Roll back the mitigation and put guardrails in place to prevent a repeat (PDB/HPA/alarms)

Conclusion

Admission webhooks are a security gate, but they are also part of delivery. That means you can’t run them with a “set it and forget it” mindset; treat them as a service with its own SLO. A solid runbook removes ambiguity during an incident: it states who owns which decision, which mitigation carries which risk, and what the rollback path looks like.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts