İçeriğe Atla
Mustafa Erbay
Tutorials · 13 min read · görüntülenme Türkçe oku
100%

vSphere/ESXi Host Patch: Maintenance Wave and Rollback Runbook

Manage the ESXi host patch process with ring-based maintenance waves, control capacity/HA risk, and establish safe remediation and rollback discipline.

vSphere/ESXi Host Patch: Maintenance Wave and Rollback Runbook — cover image

One of the riskiest jobs in virtualization platform operations is the “host patch.” That’s because you’re touching the compute layer while also handing the fate of most workloads to DRS/HA decisions. Remediation done without a solid runbook behaves less like a “patch” and more like a “platform change.”

In this post, I describe how to manage the vSphere/ESXi host patch process with maintenance waves (ring rollout) and clear rollback conditions.

Goal: not “the entire cluster at once,” but ring-based safe progress

The ring approach I use in the field:

  • Ring 0 (canary): 1 host (lowest-risk workloads)
  • Ring 1: 10–20% of the cluster
  • Ring 2: the remaining hosts

Between each ring there is a “health check” and a “rollback window.”

Pre-check list (before maintenance starts)

Before the maintenance window opens, run these checks:

  • Does the cluster have N+1 capacity? (it must absorb the loss of at least 1 host)
  • Is DRS enabled? Is vMotion healthy?
  • Is datastore capacity and latency normal?
  • Have HA admission control and the failover slot model been reviewed?
  • Have hardware/driver compatibility (HCL) and firmware dependencies been validated?
  • Has a vCenter backup/snapshot (per the org’s standard) been taken?

Scope of change: patch or firmware?

Field reality:

  • Even an ESXi-only patch can affect NIC/storage driver behavior.
  • Firmware upgrades carry higher risk and call for a separate runbook.

Recommendation:

  • Don’t combine “ESXi patch + firmware” in the same maintenance wave. Split them.

Runbook: remediation flow on a single host

  1. Evacuate the host
  • Move VMs automatically with DRS (if possible)
  • For VMs that can’t be moved, document why: affinity rule, pinned device, datastore, vMotion disabled
  1. Maintenance Mode
  • The host enters maintenance mode
  • If a “quick exit” might be needed, the plan should be explicit (back out, halt the ring)
  1. Remediate / Patch
  • Remediate via Lifecycle Manager (or your org’s standard)
  • During the operation, monitor host connectivity, datastore paths, and NIC link state
  1. Reboot + health check

Minimum health check:

  • Did the host reconnect?
  • Are the NIC uplinks and VLANs correct?
  • Are the storage paths (MPIO) normal?
  • Did cluster alarms increase?
  • Does a basic “smoke test” VM run?
  1. Exit Maintenance Mode
  • The host returns to the resource pool
  • DRS rebalance check (throttle if there’s excessive churn)

Ring gate: criteria for moving to the next wave

The post-Ring-0 “go/no-go” criteria:

  • No new alarms/incidents within 30–60 minutes
  • vMotion and HA events haven’t risen abnormally
  • Storage/NIC error counts haven’t increased
  • Application teams (for critical services) have signed off “service is stable”

Rollback plan

The rollback plan must “actually work in practice,” not just “exist in theory”:

  • If the host is unstable post-patch: put the host back in maintenance mode, halt the ring
  • The procedure for reverting to the previous image/baseline must be documented per the org’s standard
  • If needed, an emergency plan should be ready to relocate the workload to another cluster/site

Wrap-up

The vSphere/ESXi host patch process, when not managed properly, is a critical change capable of breaking the platform. With ring-based maintenance waves, a clear pre-check list, measurable gates, and a realistic rollback plan, remediation goes from a stressful overnight operation to a manageable routine. In large-scale infrastructure, sustainability isn’t about “doing the patch” but about “being able to repeat the patch safely.”

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts