İçeriğe Atla
Mustafa Erbay
Tutorials veritabani-derinlemesine · 13 min read · görüntülenme Türkçe oku
100%

PostgreSQL HA: Failover Runbook with Patroni

Walks through quorum, replication lag, switchover/failover testing and recovery steps when running PostgreSQL high availability with Patroni, in runbook form.

PostgreSQL HA: Failover Runbook with Patroni — cover image

Conversations about PostgreSQL high availability (HA) usually shrink down to “is replication enabled?” But in production, the real question is: when the primary disappears, who becomes the leader and how fast does the application recover?

Patroni solves this with automatic leader election and controlled failover/switchover. But you need to operate Patroni as a runbook, not as “set up and forget”; otherwise during a crisis you generate split-brain, data loss, or extended outages.

1) Prerequisites: HA design red lines

This post is vendor/environment-agnostic, but a few facts hold:

  • Patroni elects leaders through a DCS (Distributed Configuration Store) (etcd/Consul/ZooKeeper, etc.)
  • If DCS quorum breaks, HA stops being “automatic”
  • Replica lag directly affects the failover decision

2) Minimum observation set

The minimum signals to operate Patroni in production:

  • DCS health: quorum, leader, latency
  • Patroni cluster state: primary/replica role distribution
  • Replication lag (byte/time)
  • DB health: connection count, locks, WAL, disk
  • Application side: reconnect behaviour and timeouts

3) Daily check: “is everything OK?“

3.1 Cluster view

At the very least, set this rhythm:

  • Look at patronictl list output once a day
  • Make sure “which node is the leader?” is always crystal clear

3.2 Replica lag

The most critical risk during a failover is “a replica that fell behind becoming the leader.” So:

  • Set an SLO for lag (e.g. p95 < 1s or < 16MB)
  • When lag rises, treat it as “failover capability has degraded”

4) Planned switch: switchover runbook

Switchover is a controlled primary change. It is the test that produces the most value in production because:

  • It runs under real user traffic
  • The risk is more manageable

Minimum flow:

  1. DCS quorum is healthy
  2. Replica lag is low
  3. Application connection pool retry strategy verified
  4. Start switchover (Patroni command)
  5. Verify new leader: read/write test
  6. Confirm the old leader has become a replica

5) Unplanned loss: failover triage runbook

5.1 First 5 minutes checklist

  • Is the primary truly down? Or is it a network/DNS issue?
  • Is DCS quorum present? (no quorum can stop automatic failover)
  • Is replica lag abnormal? (high lag -> data loss risk)
  • Is the application stuck in “read-only” mode or hitting a retry storm?

5.2 The most common mistake: treating a network partition as “DB down”

If the primary is alive but the team can’t reach it:

  • Patroni may promote another node to leader
  • The primary may keep believing it is the leader

This is the classic ground for split-brain. So if a partition is suspected:

  • First validate the network/DCS side
  • Stop the failover if needed and take manual control

6) Recovery: stabilization after failover

A successful failover does not mean the job is done:

  • The old primary’s rejoin to the cluster must be controlled
  • Application pools must be properly attached to the new leader
  • Replica count and distribution must be brought back to “expected”

Minimum checklist:

  • New primary is taking writes
  • Replicas are streaming
  • Lag is back to its normal range
  • No connection pool saturation
  • Backup jobs run against the new primary

PostgreSQL HA is not just “replication is on.” With Patroni, the real win is leader election, automated decisions, and runbook discipline. When you design quorum as first-class and turn switchover drills into a routine, HA actually becomes HA in production.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts