Kubernetes admission webhooks (Validating/Mutating) are very valuable for security and governance, but in production the failure mode I encounter most often is this: a single webhook slows down and the cluster effectively becomes “non-deployable.” The reason is that the API Server waits for a webhook response inside the CREATE/UPDATE flow.
The goal of this runbook is simple: shrink “why is my deploy hanging?” down to minutes and apply a risk-controlled mitigation.
Symptoms (rapid diagnosis during an incident)
kubectl apply/helm upgraderuns unusually long or times outError from server (InternalError): ... calling webhook ... context deadline exceeded- API Server latency climbs and a wave of 5xx responses appears
- Deploys succeed in some namespaces but fail in others (match/selectors)
0) Safety note: “mitigation” = accepting risk
Bypassing webhooks (e.g. failurePolicy: Ignore) often “saves the service,” but it carries security/regulatory consequences.
1) Rapid triage: which webhook is the bottleneck?
A) Pull the webhook name from the error message
Timeout errors typically include ... calling webhook "<name>" .... That name corresponds to a webhook entry inside a ValidatingWebhookConfiguration or MutatingWebhookConfiguration.
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
B) Which service endpoint? (DNS/TLS/Network suspicion)
Open the webhook config and inspect the clientConfig.service field:
kubectl get validatingwebhookconfiguration <cfg> -o yaml | rg -n "clientConfig|service|url|caBundle|timeoutSeconds|failurePolicy"
Checklist:
- Is the service name correct?
- Is the namespace correct?
- Is the
caBundlecurrent? (this is often broken after certificate rotation) - Is
timeoutSecondsset too low? (extremely low values cause false positives, very high values lock up the control plane)
2) Five most common root causes of “deploy lockup”
- Webhook pods are down/evicted (no PDB, node death, OOM)
- DNS issues (
kube-dns/CoreDNS latency, upstream timeouts) - TLS/CA bundle drift (cert was renewed, webhook config wasn’t updated)
- NetworkPolicy/egress rules cut off access to the webhook
- The webhook implementation itself is slow (CPU saturation, cold start, downstream dependency)
3) Mitigation options (lowest to highest risk)
Option 1 — drop the timeout, fail fast (not the safest, but the most deterministic)
Goal: stop the API Server from blocking for minutes.
timeoutSeconds: a sensible value (e.g. 2–5s)- When the webhook is slow, fast failure rejects deploys quickly and the control plane gets to breathe.
This approach does not fix the service, but it makes the incident behavior predictable.
Option 2 — exclude only specific namespaces/objects (delicate but ideal)
If the webhook rules support selector/match conditions:
- Disable only the problematic check
- For instance, exclude
kube-systemor specificteam=platformnamespaces
This model narrows the blast radius instead of “bypass everything.”
Option 3 — temporary failurePolicy: Ignore (highest operational speed, highest risk)
This option unblocks deploys but removes the guardrail. Before applying it:
- For how long? (e.g. 30–60 min)
- Who approved it? (IC + security owner)
- What gets monitored? (audit/alert)
4) Remediation: make the webhook production-grade (the lasting fix)
A) Availability
- At least 2 replicas, plus PDB and HPA
- Node affinity/anti-affinity (don’t pile them onto one node)
- Correct liveness/readiness (the API Server shouldn’t reach an “unready” endpoint)
B) Performance
- Reduce the webhook’s downstream dependencies (cache/timeout any DB or remote calls)
- Measure cold start (especially for big Java/Go images)
C) TLS/CA management
- Automate certificate rotation
- Don’t let
caBundleupdates drift
D) Observability
- Webhook latency histogram
- Error rate + timeout count
- Admission latency on the API Server side (trace it if possible)
5) “Deploy lockup” runbook flow (summary)
- Pull the webhook name from the error message
- Validate endpoint/DNS/TLS via
clientConfig - Pick the lowest-risk mitigation (timeout/selector → Ignore as the last resort)
- Fix the root cause around webhook availability/perf/tls
- Roll back the mitigation and put guardrails in place to prevent a repeat (PDB/HPA/alarms)
Conclusion
Admission webhooks are a security gate, but they are also part of delivery. That means you can’t run them with a “set it and forget it” mindset; treat them as a service with its own SLO. A solid runbook removes ambiguity during an incident: it states who owns which decision, which mitigation carries which risk, and what the rollback path looks like.