Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook

In Kubernetes, the API server is the visible front door, but the heart of the control plane is ETCD. When quorum is lost, the cluster can look “not entirely down”: nodes still run, pods still serve traffic, yet new deploys, autoscaling, and admission/CRD operations all start failing.

This post is a runbook designed for discipline, not panic, when ETCD loses quorum.

1) Symptoms: when does ETCD become the prime suspect?

kubectl timing out / Unable to connect to the server
A wave of API 5xx responses (especially etcdserver: request timed out)
Erratic behavior from the scheduler/controller-manager
A spike in disk latency on the control plane nodes

2) The first 10 minutes: contain the damage

Stop changes: if any automation is running (GitOps/CI), put it on “pause.”
Inspect resource health on control plane nodes: disk fullness, IOPS, CPU steal.
Determine whether the ETCD members are actually alive (logs + endpoint health).

3) Triage: ETCD health checks (sample commands)

The TLS parameters vary by environment, but the principle stays the same.

export ETCDCTL_API=3

# Örnek: control plane üzerinde çalıştırın
etcdctl endpoint status --write-out=table
etcdctl endpoint health --write-out=table
etcdctl member list --write-out=table

Reading the output:

A subset of members is “unhealthy” → quorum at risk
No leader / constant leader churn → network or disk problem

4) Decision tree: do we still have quorum?

4.1 Quorum is intact (e.g. 2 of 3 members healthy)

Goal: stabilize the cluster.

Isolate the bad member (don’t drain the node; isolate ETCD’s connectivity)
Bring disk latency down (noisy neighbor, storage problems)
Plan ETCD compaction/defrag if needed (not now; only after stabilization)

4.2 Quorum is lost (e.g. only 1 of 3 members left)

Goal: restore the control plane through a consistent restore.

At this point you have two options:

Restore from the last solid snapshot (recommended)
Risky paths like “force new cluster” from the surviving member (last resort)

5) Recovery: the snapshot restore approach (recommended)

Prerequisites:

Snapshots must be taken regularly (and copied off to another node/store)
The integrity of the snapshot must be verifiable

High-level steps:

Stop every ETCD instance (terminate uncontrolled writes)
Pick the snapshot (most recent and intact)
Restore into a fresh ETCD data dir
Bring members back up and confirm a leader is elected
Start the API servers afterward

6) Avoiding split-brain

The biggest danger during quorum loss is two divergent realities forming.

Avoidance principles:

Don’t run multiple “recovery attempts” in parallel
Don’t mix the old data dir with a new restore
Don’t force a cluster while there’s an active network partition
Don’t re-enable automation against a freshly restored cluster until it’s stable

7) Preventive controls (post-incident action list)

Are the ETCD members in a single failure domain? (same storage / same rack / same AZ)
Are there disk latency alarms? (fsync, wal writes)
What’s the snapshot frequency, and are off-host copies in place?
Quorum: design with 3 or 5 members (avoid even counts)
Is compaction/defrag maintenance scheduled?

8) Closing thoughts

ETCD quorum loss usually isn’t “Kubernetes is broken”; it’s a breakdown of disk, network, or maintenance discipline. The runbook’s purpose isn’t fast heroics — it’s bringing production back while preserving consistency. Take snapshots seriously, rehearse restores, and keep your automation’s restart loop under control.

Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook

1) Symptoms: when does ETCD become the prime suspect?

2) The first 10 minutes: contain the damage

3) Triage: ETCD health checks (sample commands)

4) Decision tree: do we still have quorum?

4.1 Quorum is intact (e.g. 2 of 3 members healthy)

4.2 Quorum is lost (e.g. only 1 of 3 members left)

5) Recovery: the snapshot restore approach (recommended)

6) Avoiding split-brain

7) Preventive controls (post-incident action list)

8) Closing thoughts

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Admission Webhook Timeouts: A Runbook for Frozen Deploys

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

Kubernetes Control Plane Certificate Expiry: A Runbook

1) Symptoms: when does ETCD become the prime suspect?

2) The first 10 minutes: contain the damage

3) Triage: ETCD health checks (sample commands)

4) Decision tree: do we still have quorum?

4.1 Quorum is intact (e.g. 2 of 3 members healthy)

4.2 Quorum is lost (e.g. only 1 of 3 members left)

5) Recovery: the snapshot restore approach (recommended)

6) Avoiding split-brain

7) Preventive controls (post-incident action list)

8) Closing thoughts

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Kubernetes Admission Webhook Timeouts: A Runbook for Frozen Deploys

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

Kubernetes Control Plane Certificate Expiry: A Runbook

Klavye Kısayolları