İçeriğe Atla
Mustafa Erbay
Tutorials kubernetes-uretim-guvenlik · 9 min read · görüntülenme Türkçe oku
100%

Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook

A runbook for quickly diagnosing ETCD quorum during API 5xx/timeout storms and walking through safe recovery steps via snapshot restore.

Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook — cover image

In Kubernetes, the API server is the visible front door, but the heart of the control plane is ETCD. When quorum is lost, the cluster can look “not entirely down”: nodes still run, pods still serve traffic, yet new deploys, autoscaling, and admission/CRD operations all start failing.

This post is a runbook designed for discipline, not panic, when ETCD loses quorum.

1) Symptoms: when does ETCD become the prime suspect?

  • kubectl timing out / Unable to connect to the server
  • A wave of API 5xx responses (especially etcdserver: request timed out)
  • Erratic behavior from the scheduler/controller-manager
  • A spike in disk latency on the control plane nodes

2) The first 10 minutes: contain the damage

  1. Stop changes: if any automation is running (GitOps/CI), put it on “pause.”
  2. Inspect resource health on control plane nodes: disk fullness, IOPS, CPU steal.
  3. Determine whether the ETCD members are actually alive (logs + endpoint health).

3) Triage: ETCD health checks (sample commands)

The TLS parameters vary by environment, but the principle stays the same.

export ETCDCTL_API=3

# Örnek: control plane üzerinde çalıştırın
etcdctl endpoint status --write-out=table
etcdctl endpoint health --write-out=table
etcdctl member list --write-out=table

Reading the output:

  • A subset of members is “unhealthy” → quorum at risk
  • No leader / constant leader churn → network or disk problem

4) Decision tree: do we still have quorum?

4.1 Quorum is intact (e.g. 2 of 3 members healthy)

Goal: stabilize the cluster.

  • Isolate the bad member (don’t drain the node; isolate ETCD’s connectivity)
  • Bring disk latency down (noisy neighbor, storage problems)
  • Plan ETCD compaction/defrag if needed (not now; only after stabilization)

4.2 Quorum is lost (e.g. only 1 of 3 members left)

Goal: restore the control plane through a consistent restore.

At this point you have two options:

  • Restore from the last solid snapshot (recommended)
  • Risky paths like “force new cluster” from the surviving member (last resort)

Prerequisites:

  • Snapshots must be taken regularly (and copied off to another node/store)
  • The integrity of the snapshot must be verifiable

High-level steps:

  1. Stop every ETCD instance (terminate uncontrolled writes)
  2. Pick the snapshot (most recent and intact)
  3. Restore into a fresh ETCD data dir
  4. Bring members back up and confirm a leader is elected
  5. Start the API servers afterward

6) Avoiding split-brain

The biggest danger during quorum loss is two divergent realities forming.

Avoidance principles:

  • Don’t run multiple “recovery attempts” in parallel
  • Don’t mix the old data dir with a new restore
  • Don’t force a cluster while there’s an active network partition
  • Don’t re-enable automation against a freshly restored cluster until it’s stable

7) Preventive controls (post-incident action list)

  • Are the ETCD members in a single failure domain? (same storage / same rack / same AZ)
  • Are there disk latency alarms? (fsync, wal writes)
  • What’s the snapshot frequency, and are off-host copies in place?
  • Quorum: design with 3 or 5 members (avoid even counts)
  • Is compaction/defrag maintenance scheduled?

8) Closing thoughts

ETCD quorum loss usually isn’t “Kubernetes is broken”; it’s a breakdown of disk, network, or maintenance discipline. The runbook’s purpose isn’t fast heroics — it’s bringing production back while preserving consistency. Take snapshots seriously, rehearse restores, and keep your automation’s restart loop under control.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts