One of the riskiest jobs in virtualization platform operations is the “host patch.” That’s because you’re touching the compute layer while also handing the fate of most workloads to DRS/HA decisions. Remediation done without a solid runbook behaves less like a “patch” and more like a “platform change.”
In this post, I describe how to manage the vSphere/ESXi host patch process with maintenance waves (ring rollout) and clear rollback conditions.
Goal: not “the entire cluster at once,” but ring-based safe progress
The ring approach I use in the field:
- Ring 0 (canary): 1 host (lowest-risk workloads)
- Ring 1: 10–20% of the cluster
- Ring 2: the remaining hosts
Between each ring there is a “health check” and a “rollback window.”
Pre-check list (before maintenance starts)
Before the maintenance window opens, run these checks:
- Does the cluster have N+1 capacity? (it must absorb the loss of at least 1 host)
- Is DRS enabled? Is vMotion healthy?
- Is datastore capacity and latency normal?
- Have HA admission control and the failover slot model been reviewed?
- Have hardware/driver compatibility (HCL) and firmware dependencies been validated?
- Has a vCenter backup/snapshot (per the org’s standard) been taken?
Scope of change: patch or firmware?
Field reality:
- Even an ESXi-only patch can affect NIC/storage driver behavior.
- Firmware upgrades carry higher risk and call for a separate runbook.
Recommendation:
- Don’t combine “ESXi patch + firmware” in the same maintenance wave. Split them.
Runbook: remediation flow on a single host
- Evacuate the host
- Move VMs automatically with DRS (if possible)
- For VMs that can’t be moved, document why: affinity rule, pinned device, datastore, vMotion disabled
- Maintenance Mode
- The host enters maintenance mode
- If a “quick exit” might be needed, the plan should be explicit (back out, halt the ring)
- Remediate / Patch
- Remediate via Lifecycle Manager (or your org’s standard)
- During the operation, monitor host connectivity, datastore paths, and NIC link state
- Reboot + health check
Minimum health check:
- Did the host reconnect?
- Are the NIC uplinks and VLANs correct?
- Are the storage paths (MPIO) normal?
- Did cluster alarms increase?
- Does a basic “smoke test” VM run?
- Exit Maintenance Mode
- The host returns to the resource pool
- DRS rebalance check (throttle if there’s excessive churn)
Ring gate: criteria for moving to the next wave
The post-Ring-0 “go/no-go” criteria:
- No new alarms/incidents within 30–60 minutes
- vMotion and HA events haven’t risen abnormally
- Storage/NIC error counts haven’t increased
- Application teams (for critical services) have signed off “service is stable”
Rollback plan
The rollback plan must “actually work in practice,” not just “exist in theory”:
- If the host is unstable post-patch: put the host back in maintenance mode, halt the ring
- The procedure for reverting to the previous image/baseline must be documented per the org’s standard
- If needed, an emergency plan should be ready to relocate the workload to another cluster/site
Wrap-up
The vSphere/ESXi host patch process, when not managed properly, is a critical change capable of breaking the platform. With ring-based maintenance waves, a clear pre-check list, measurable gates, and a realistic rollback plan, remediation goes from a stressful overnight operation to a manageable routine. In large-scale infrastructure, sustainability isn’t about “doing the patch” but about “being able to repeat the patch safely.”