Designing Maintenance Waves for Kubernetes Node OS Patching

When it comes to “node OS patching” on Kubernetes, teams usually swing to one of two extremes — either delaying it indefinitely or rolling it out aggressively and triggering an incident. Both are wrong. The robust path is to break the patch into maintenance waves, contain the blast radius, and keep rollback simple.

Pre-requisites: before any patch starts

Before I touch a patch process, I insist on these three items:

Are PodDisruptionBudgets (PDBs) defined for the critical services?
Are readiness/liveness probes meaningful? (not just a “/healthz 200”)
Is there enough capacity headroom? (at least one wave’s worth of nodes can drop out)

Design: what is a wave?

A wave is a group of nodes that go into patching together. Good wave design:

balances by AZ/zone
avoids hitting all pods of the same service at the same time
makes “which wave were we in?” a clear question on rollback

A practical example:

Wave 0 (canary): 1–2 nodes
Wave 1: 10% of the fleet
Wave 2: 25% of the fleet
Wave 3: the rest

Implementation: tag nodes by wave

A simple label:

kubectl label node worker-01 maintenance.wave=0
kubectl label node worker-02 maintenance.wave=1
kubectl label node worker-03 maintenance.wave=1

Driving the labels through GitOps is even better, but doing it manually is fine as a first step.

Operational flow: drain → patch → uncordon

The minimum flow per node:

cordon
drain (with safe parameters)
OS patch / reboot
uncordon once the node is ready
Validate the service-level signals

A drain example:

kubectl cordon worker-01
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m

Then patch/reboot, and:

kubectl uncordon worker-01

Parallelism: how many nodes at once?

My rule of thumb is short:

1 in the canary wave
After that: 1 per AZ (or fewer, depending on how critical the service is)

What is wrong is deciding off a single metric like “20% of total nodes.” Zone distribution and per-service replica count carry far more weight.

Rollback plan

What makes patching safe is not “the absence of problems,” it is knowing exactly what to do if a problem hits:

If p95/p99/5xx degrade after Wave 0, do not advance to Wave 1
For kernel/driver issues, halt that same wave and revert the image fast
Have a ready-to-run step for restoring nodes to a “known-good” AMI/image

Minimum runbook checklist

Waves and parallelism are defined
PDBs and probes are validated
Canary wave ready, on-call ready
Drain commands standardized
Rollback step is explicit
Post-patch validation (SLO + business metric)

With this discipline, node patching stops being “risky overnight work” and turns into a planned, measurable, repeatable operation.

Designing Maintenance Waves for Kubernetes Node OS Patching

Pre-requisites: before any patch starts

Design: what is a wave?

Implementation: tag nodes by wave

Operational flow: drain → patch → uncordon

Parallelism: how many nodes at once?

Rollback plan

Minimum runbook checklist

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Control Plane Certificate Expiry: A Runbook

Protecting the Kubernetes Control Plane with API Priority and Fairness

Kubernetes API Server Audit Log: Policy and SIEM Pipeline

Pre-requisites: before any patch starts

Design: what is a wave?

Implementation: tag nodes by wave

Operational flow: drain → patch → uncordon

Parallelism: how many nodes at once?

Rollback plan

Minimum runbook checklist

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Control Plane Certificate Expiry: A Runbook

Protecting the Kubernetes Control Plane with API Priority and Fairness

Kubernetes API Server Audit Log: Policy and SIEM Pipeline

Klavye Kısayolları