İçeriğe Atla
Mustafa Erbay
Tutorials kubernetes-uretim-guvenlik · 11 min read · görüntülenme Türkçe oku
100%

Designing Maintenance Waves for Kubernetes Node OS Patching

Roll out node patches in maintenance waves rather than all-at-once: drain, PDB, parallelism, and a safe rollback path.

Designing Maintenance Waves for Kubernetes Node OS Patching — cover image

When it comes to “node OS patching” on Kubernetes, teams usually swing to one of two extremes — either delaying it indefinitely or rolling it out aggressively and triggering an incident. Both are wrong. The robust path is to break the patch into maintenance waves, contain the blast radius, and keep rollback simple.

Pre-requisites: before any patch starts

Before I touch a patch process, I insist on these three items:

  • Are PodDisruptionBudgets (PDBs) defined for the critical services?
  • Are readiness/liveness probes meaningful? (not just a “/healthz 200”)
  • Is there enough capacity headroom? (at least one wave’s worth of nodes can drop out)

Design: what is a wave?

A wave is a group of nodes that go into patching together. Good wave design:

  • balances by AZ/zone
  • avoids hitting all pods of the same service at the same time
  • makes “which wave were we in?” a clear question on rollback

A practical example:

  • Wave 0 (canary): 1–2 nodes
  • Wave 1: 10% of the fleet
  • Wave 2: 25% of the fleet
  • Wave 3: the rest

Implementation: tag nodes by wave

A simple label:

kubectl label node worker-01 maintenance.wave=0
kubectl label node worker-02 maintenance.wave=1
kubectl label node worker-03 maintenance.wave=1

Driving the labels through GitOps is even better, but doing it manually is fine as a first step.

Operational flow: drain → patch → uncordon

The minimum flow per node:

  1. cordon
  2. drain (with safe parameters)
  3. OS patch / reboot
  4. uncordon once the node is ready
  5. Validate the service-level signals

A drain example:

kubectl cordon worker-01
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m

Then patch/reboot, and:

kubectl uncordon worker-01

Parallelism: how many nodes at once?

My rule of thumb is short:

  • 1 in the canary wave
  • After that: 1 per AZ (or fewer, depending on how critical the service is)

What is wrong is deciding off a single metric like “20% of total nodes.” Zone distribution and per-service replica count carry far more weight.

Rollback plan

What makes patching safe is not “the absence of problems,” it is knowing exactly what to do if a problem hits:

  • If p95/p99/5xx degrade after Wave 0, do not advance to Wave 1
  • For kernel/driver issues, halt that same wave and revert the image fast
  • Have a ready-to-run step for restoring nodes to a “known-good” AMI/image

Minimum runbook checklist

  • Waves and parallelism are defined
  • PDBs and probes are validated
  • Canary wave ready, on-call ready
  • Drain commands standardized
  • Rollback step is explicit
  • Post-patch validation (SLO + business metric)

With this discipline, node patching stops being “risky overnight work” and turns into a planned, measurable, repeatable operation.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts