In Kubernetes, the API server is the visible front door, but the heart of the control plane is ETCD. When quorum is lost, the cluster can look “not entirely down”: nodes still run, pods still serve traffic, yet new deploys, autoscaling, and admission/CRD operations all start failing.
This post is a runbook designed for discipline, not panic, when ETCD loses quorum.
1) Symptoms: when does ETCD become the prime suspect?
kubectltiming out /Unable to connect to the server- A wave of API 5xx responses (especially
etcdserver: request timed out) - Erratic behavior from the scheduler/controller-manager
- A spike in disk latency on the control plane nodes
2) The first 10 minutes: contain the damage
- Stop changes: if any automation is running (GitOps/CI), put it on “pause.”
- Inspect resource health on control plane nodes: disk fullness, IOPS, CPU steal.
- Determine whether the ETCD members are actually alive (logs + endpoint health).
3) Triage: ETCD health checks (sample commands)
The TLS parameters vary by environment, but the principle stays the same.
export ETCDCTL_API=3
# Örnek: control plane üzerinde çalıştırın
etcdctl endpoint status --write-out=table
etcdctl endpoint health --write-out=table
etcdctl member list --write-out=table
Reading the output:
- A subset of members is “unhealthy” → quorum at risk
- No leader / constant leader churn → network or disk problem
4) Decision tree: do we still have quorum?
4.1 Quorum is intact (e.g. 2 of 3 members healthy)
Goal: stabilize the cluster.
- Isolate the bad member (don’t drain the node; isolate ETCD’s connectivity)
- Bring disk latency down (noisy neighbor, storage problems)
- Plan ETCD compaction/defrag if needed (not now; only after stabilization)
4.2 Quorum is lost (e.g. only 1 of 3 members left)
Goal: restore the control plane through a consistent restore.
At this point you have two options:
- Restore from the last solid snapshot (recommended)
- Risky paths like “force new cluster” from the surviving member (last resort)
5) Recovery: the snapshot restore approach (recommended)
Prerequisites:
- Snapshots must be taken regularly (and copied off to another node/store)
- The integrity of the snapshot must be verifiable
High-level steps:
- Stop every ETCD instance (terminate uncontrolled writes)
- Pick the snapshot (most recent and intact)
- Restore into a fresh ETCD data dir
- Bring members back up and confirm a leader is elected
- Start the API servers afterward
6) Avoiding split-brain
The biggest danger during quorum loss is two divergent realities forming.
Avoidance principles:
- Don’t run multiple “recovery attempts” in parallel
- Don’t mix the old data dir with a new restore
- Don’t force a cluster while there’s an active network partition
- Don’t re-enable automation against a freshly restored cluster until it’s stable
7) Preventive controls (post-incident action list)
- Are the ETCD members in a single failure domain? (same storage / same rack / same AZ)
- Are there disk latency alarms? (
fsync,walwrites) - What’s the snapshot frequency, and are off-host copies in place?
- Quorum: design with 3 or 5 members (avoid even counts)
- Is compaction/defrag maintenance scheduled?
8) Closing thoughts
ETCD quorum loss usually isn’t “Kubernetes is broken”; it’s a breakdown of disk, network, or maintenance discipline. The runbook’s purpose isn’t fast heroics — it’s bringing production back while preserving consistency. Take snapshots seriously, rehearse restores, and keep your automation’s restart loop under control.