Tutorials veritabani-derinlemesine April 17, 2026 · 13 min read · … görüntülenme Türkçe oku

100%

PostgreSQL HA: Failover Runbook with Patroni

Walks through quorum, replication lag, switchover/failover testing and recovery steps when running PostgreSQL high availability with Patroni, in runbook form.

#database #postgresql #patroni #ha #failover #runbook #operations

PostgreSQL HA: Failover Runbook with Patroni — cover image

Conversations about PostgreSQL high availability (HA) usually shrink down to “is replication enabled?” But in production, the real question is: when the primary disappears, who becomes the leader and how fast does the application recover?

Patroni solves this with automatic leader election and controlled failover/switchover. But you need to operate Patroni as a runbook, not as “set up and forget”; otherwise during a crisis you generate split-brain, data loss, or extended outages.

1) Prerequisites: HA design red lines

This post is vendor/environment-agnostic, but a few facts hold:

Patroni elects leaders through a DCS (Distributed Configuration Store) (etcd/Consul/ZooKeeper, etc.)
If DCS quorum breaks, HA stops being “automatic”
Replica lag directly affects the failover decision

2) Minimum observation set

The minimum signals to operate Patroni in production:

DCS health: quorum, leader, latency
Patroni cluster state: primary/replica role distribution
Replication lag (byte/time)
DB health: connection count, locks, WAL, disk
Application side: reconnect behaviour and timeouts

3) Daily check: “is everything OK?“

3.1 Cluster view

At the very least, set this rhythm:

Look at patronictl list output once a day
Make sure “which node is the leader?” is always crystal clear

3.2 Replica lag

The most critical risk during a failover is “a replica that fell behind becoming the leader.” So:

Set an SLO for lag (e.g. p95 < 1s or < 16MB)
When lag rises, treat it as “failover capability has degraded”

4) Planned switch: switchover runbook

Switchover is a controlled primary change. It is the test that produces the most value in production because:

It runs under real user traffic
The risk is more manageable

Minimum flow:

DCS quorum is healthy
Replica lag is low
Application connection pool retry strategy verified
Start switchover (Patroni command)
Verify new leader: read/write test
Confirm the old leader has become a replica

5) Unplanned loss: failover triage runbook

5.1 First 5 minutes checklist

Is the primary truly down? Or is it a network/DNS issue?
Is DCS quorum present? (no quorum can stop automatic failover)
Is replica lag abnormal? (high lag -> data loss risk)
Is the application stuck in “read-only” mode or hitting a retry storm?

5.2 The most common mistake: treating a network partition as “DB down”

If the primary is alive but the team can’t reach it:

Patroni may promote another node to leader
The primary may keep believing it is the leader

This is the classic ground for split-brain. So if a partition is suspected:

First validate the network/DCS side
Stop the failover if needed and take manual control

6) Recovery: stabilization after failover

A successful failover does not mean the job is done:

The old primary’s rejoin to the cluster must be controlled
Application pools must be properly attached to the new leader
Replica count and distribution must be brought back to “expected”

Minimum checklist:

New primary is taking writes
Replicas are streaming
Lag is back to its normal range
No connection pool saturation
Backup jobs run against the new primary

PostgreSQL HA is not just “replication is on.” With Patroni, the real win is leader election, automated decisions, and runbook discipline. When you design quorum as first-class and turn switchover drills into a routine, HA actually becomes HA in production.

Paylaş:

Bu yazı nasıldı?

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

📌
Best of the week Single most-worth-reading post
🔧
Toolbox notes Real tools I used this week
🧠
Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

Posts Read

Reading Time

Day Streak

Favorite Category

Tutorials

PostgreSQL HA: Failover Runbook with Patroni

1) Prerequisites: HA design red lines

2) Minimum observation set

3) Daily check: “is everything OK?“

3.1 Cluster view

3.2 Replica lag

4) Planned switch: switchover runbook

5) Unplanned loss: failover triage runbook

5.1 First 5 minutes checklist

5.2 The most common mistake: treating a network partition as “DB down”

6) Recovery: stabilization after failover

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

Anatomy of Database Index Structures: Fundamentals of Query

Kubernetes Control Plane Certificate Expiry: A Runbook

1) Prerequisites: HA design red lines

2) Minimum observation set

3) Daily check: “is everything OK?“

3.1 Cluster view

3.2 Replica lag

4) Planned switch: switchover runbook

5) Unplanned loss: failover triage runbook

5.1 First 5 minutes checklist

5.2 The most common mistake: treating a network partition as “DB down”

6) Recovery: stabilization after failover

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

Anatomy of Database Index Structures: Fundamentals of Query

Kubernetes Control Plane Certificate Expiry: A Runbook

Klavye Kısayolları