PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

Taking a backup and being able to come back from one are not the same thing. On the PostgreSQL side that gap shows itself most clearly in Point-in-Time Recovery (PITR) scenarios. There is a snapshot, but you can’t roll back to the exact minute you wanted; there is WAL, but the chain has gaps; there is a restore, but the application team can’t tell when to reopen traffic.

In this post I’m covering the PITR flow that actually works in production, built on WAL archiving + base backup + drill discipline. The goal isn’t being able to say “we have backups,” it’s being able to say “we can confidently get back to that point in time.”

1) Start with the goal: how far back, in how much time?

PITR design doesn’t start with technical knobs, it starts with two operational questions:

RPO: How many minutes of data loss is acceptable?
RTO: How quickly does the service have to be back up?

Those two targets directly determine archive frequency, storage class, and restore infrastructure. If RPO is 5 minutes, “one dump per day” is already off the table. If RTO is 20 minutes, restoring a 2 TB database from a single-spindle disk isn’t realistic either.

2) The solid model: base backup + continuous WAL archive

The most practical structure:

Regular base backups
A continuous and validated WAL archive
A restore rehearsal pipeline in a separate environment

If any one of these three is missing, PITR exists only on paper. The base backup gives you the anchor point, the WAL fills in the gap, and the rehearsal pipeline proves the chain actually works.

What do I always validate?

archive_mode=on
Is archive_command actually checking errors?
Is the archive target in a separate failure domain? (writing to the same disk/cluster is risky)
Is there corruption checking on WAL files before they’re restored?

A simple but high-value rule: silent failure is not acceptable. If your archive command returns 0 while quietly leaving a file missing, you don’t actually have a backup.

3) The archive target: cheap storage isn’t enough, you need verifiable storage

Two failures show up in the field over and over:

WAL is written into the same infrastructure, and when the primary system goes, the archive goes with it
There’s object storage, but its integrity and access model are undefined

The right approach:

Write to a different failure domain
Run a checksum/integrity check on the object you wrote
Define a lifecycle: how many days hot, how many days archive?
Measure access speed during a restore

If you’re on object storage, just saying “it’s sitting there” isn’t enough. A wrong lifecycle policy on the relevant prefix, replication lag, or an access key issue will show up exactly on the day you need PITR.

4) Restore runbook: what do I do, and at what time?

Your PITR runbook has to make these decisions explicit:

How will the target timestamp be chosen?
Which base backup will be used?
Will the restored copy be brought up on an isolated network?
When will the application be moved to read-only or maintenance mode?
When will traffic be cut over to the new copy?

Sample flow:

pg_basebackup --write-recovery-conf --pgdata /restore/pgdata
restore_command='cp /wal-archive/%f %p'
recovery_target_time='2026-04-29 10:37:00+03'

The command line on its own isn’t enough. The most critical decision is who signs off on the target timestamp. Roll back too early and you lose more data; roll back too late and you haven’t actually undone the bad changes.

5) Drills: the success criterion isn’t “the restore came up”

A good drill answers questions like these:

Did we actually return to the target time?
Did the application go through critical flows like login/checkout/order?
Have replication and the new backup chain been re-engaged?
How long did the whole thing take?

My recommendation is to wrap up the drill outcome under these headings:

Technical result: did the chain hold together?
Duration: did it meet RTO?
Decision gap: where did the team hesitate?
Action: what gets fixed before the next drill?

6) The most expensive mistakes

Believing the archive target “works” without ever doing a restore from it
Not rehearsing base-backup-and-WAL version mismatches
Letting only the DBA know the PITR runbook
Forgetting to start a new backup chain after recovery

The last one is especially critical. After a restore, the system comes back up, but if a fresh backup chain isn’t started you’ll be blind again the next time something fails.

Conclusion

PostgreSQL PITR is a test, not of the comfort of saying “we take backups,” but of the discipline of coming back. Run base backups, the WAL archive, and regular drills together as a trio and you can actually manage the data-loss window. What makes a difference in production isn’t memorizing commands; it’s nailing down the target time, the access model, and the recovery-aftermath chain.

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

1) Start with the goal: how far back, in how much time?

2) The solid model: base backup + continuous WAL archive

What do I always validate?

3) The archive target: cheap storage isn’t enough, you need verifiable storage

4) Restore runbook: what do I do, and at what time?

5) Drills: the success criterion isn’t “the restore came up”

6) The most expensive mistakes

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

PostgreSQL HA: Failover Runbook with Patroni

Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook

Anatomy of Database Index Structures: Fundamentals of Query

1) Start with the goal: how far back, in how much time?

2) The solid model: base backup + continuous WAL archive

What do I always validate?

3) The archive target: cheap storage isn’t enough, you need verifiable storage

4) Restore runbook: what do I do, and at what time?

5) Drills: the success criterion isn’t “the restore came up”

6) The most expensive mistakes

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

PostgreSQL HA: Failover Runbook with Patroni

Kubernetes ETCD Quorum Loss: Triage and Recovery Runbook

Anatomy of Database Index Structures: Fundamentals of Query

Klavye Kısayolları