When it comes to “node OS patching” on Kubernetes, teams usually swing to one of two extremes — either delaying it indefinitely or rolling it out aggressively and triggering an incident. Both are wrong. The robust path is to break the patch into maintenance waves, contain the blast radius, and keep rollback simple.
Pre-requisites: before any patch starts
Before I touch a patch process, I insist on these three items:
- Are PodDisruptionBudgets (PDBs) defined for the critical services?
- Are readiness/liveness probes meaningful? (not just a “/healthz 200”)
- Is there enough capacity headroom? (at least one wave’s worth of nodes can drop out)
Design: what is a wave?
A wave is a group of nodes that go into patching together. Good wave design:
- balances by AZ/zone
- avoids hitting all pods of the same service at the same time
- makes “which wave were we in?” a clear question on rollback
A practical example:
- Wave 0 (canary): 1–2 nodes
- Wave 1: 10% of the fleet
- Wave 2: 25% of the fleet
- Wave 3: the rest
Implementation: tag nodes by wave
A simple label:
kubectl label node worker-01 maintenance.wave=0
kubectl label node worker-02 maintenance.wave=1
kubectl label node worker-03 maintenance.wave=1
Driving the labels through GitOps is even better, but doing it manually is fine as a first step.
Operational flow: drain → patch → uncordon
The minimum flow per node:
cordondrain(with safe parameters)- OS patch / reboot
uncordononce the node is ready- Validate the service-level signals
A drain example:
kubectl cordon worker-01
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
Then patch/reboot, and:
kubectl uncordon worker-01
Parallelism: how many nodes at once?
My rule of thumb is short:
- 1 in the canary wave
- After that: 1 per AZ (or fewer, depending on how critical the service is)
What is wrong is deciding off a single metric like “20% of total nodes.” Zone distribution and per-service replica count carry far more weight.
Rollback plan
What makes patching safe is not “the absence of problems,” it is knowing exactly what to do if a problem hits:
- If p95/p99/5xx degrade after Wave 0, do not advance to Wave 1
- For kernel/driver issues, halt that same wave and revert the image fast
- Have a ready-to-run step for restoring nodes to a “known-good” AMI/image
Minimum runbook checklist
- Waves and parallelism are defined
- PDBs and probes are validated
- Canary wave ready, on-call ready
- Drain commands standardized
- Rollback step is explicit
- Post-patch validation (SLO + business metric)
With this discipline, node patching stops being “risky overnight work” and turns into a planned, measurable, repeatable operation.