Conversations about PostgreSQL high availability (HA) usually shrink down to “is replication enabled?” But in production, the real question is: when the primary disappears, who becomes the leader and how fast does the application recover?
Patroni solves this with automatic leader election and controlled failover/switchover. But you need to operate Patroni as a runbook, not as “set up and forget”; otherwise during a crisis you generate split-brain, data loss, or extended outages.
1) Prerequisites: HA design red lines
This post is vendor/environment-agnostic, but a few facts hold:
- Patroni elects leaders through a DCS (Distributed Configuration Store) (etcd/Consul/ZooKeeper, etc.)
- If DCS quorum breaks, HA stops being “automatic”
- Replica lag directly affects the failover decision
2) Minimum observation set
The minimum signals to operate Patroni in production:
- DCS health: quorum, leader, latency
- Patroni cluster state: primary/replica role distribution
- Replication lag (byte/time)
- DB health: connection count, locks, WAL, disk
- Application side: reconnect behaviour and timeouts
3) Daily check: “is everything OK?“
3.1 Cluster view
At the very least, set this rhythm:
- Look at
patronictl listoutput once a day - Make sure “which node is the leader?” is always crystal clear
3.2 Replica lag
The most critical risk during a failover is “a replica that fell behind becoming the leader.” So:
- Set an SLO for lag (e.g. p95 < 1s or < 16MB)
- When lag rises, treat it as “failover capability has degraded”
4) Planned switch: switchover runbook
Switchover is a controlled primary change. It is the test that produces the most value in production because:
- It runs under real user traffic
- The risk is more manageable
Minimum flow:
- DCS quorum is healthy
- Replica lag is low
- Application connection pool retry strategy verified
- Start switchover (Patroni command)
- Verify new leader: read/write test
- Confirm the old leader has become a replica
5) Unplanned loss: failover triage runbook
5.1 First 5 minutes checklist
- Is the primary truly down? Or is it a network/DNS issue?
- Is DCS quorum present? (no quorum can stop automatic failover)
- Is replica lag abnormal? (high lag -> data loss risk)
- Is the application stuck in “read-only” mode or hitting a retry storm?
5.2 The most common mistake: treating a network partition as “DB down”
If the primary is alive but the team can’t reach it:
- Patroni may promote another node to leader
- The primary may keep believing it is the leader
This is the classic ground for split-brain. So if a partition is suspected:
- First validate the network/DCS side
- Stop the failover if needed and take manual control
6) Recovery: stabilization after failover
A successful failover does not mean the job is done:
- The old primary’s rejoin to the cluster must be controlled
- Application pools must be properly attached to the new leader
- Replica count and distribution must be brought back to “expected”
Minimum checklist:
- New primary is taking writes
- Replicas are streaming
- Lag is back to its normal range
- No connection pool saturation
- Backup jobs run against the new primary
PostgreSQL HA is not just “replication is on.” With Patroni, the real win is leader election, automated decisions, and runbook discipline. When you design quorum as first-class and turn switchover drills into a routine, HA actually becomes HA in production.