İçeriğe Atla
Mustafa Erbay
Tutorials veritabani-derinlemesine · 11 min read · görüntülenme Türkçe oku
100%

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill

A guide to building PostgreSQL PITR practice with production discipline: WAL archiving, recovery time targets and safe restoration steps.

PostgreSQL WAL Archiving and a Point-in-Time Recovery Drill — cover image

Taking a backup and being able to come back from one are not the same thing. On the PostgreSQL side that gap shows itself most clearly in Point-in-Time Recovery (PITR) scenarios. There is a snapshot, but you can’t roll back to the exact minute you wanted; there is WAL, but the chain has gaps; there is a restore, but the application team can’t tell when to reopen traffic.

In this post I’m covering the PITR flow that actually works in production, built on WAL archiving + base backup + drill discipline. The goal isn’t being able to say “we have backups,” it’s being able to say “we can confidently get back to that point in time.”

1) Start with the goal: how far back, in how much time?

PITR design doesn’t start with technical knobs, it starts with two operational questions:

  • RPO: How many minutes of data loss is acceptable?
  • RTO: How quickly does the service have to be back up?

Those two targets directly determine archive frequency, storage class, and restore infrastructure. If RPO is 5 minutes, “one dump per day” is already off the table. If RTO is 20 minutes, restoring a 2 TB database from a single-spindle disk isn’t realistic either.

2) The solid model: base backup + continuous WAL archive

The most practical structure:

  • Regular base backups
  • A continuous and validated WAL archive
  • A restore rehearsal pipeline in a separate environment

If any one of these three is missing, PITR exists only on paper. The base backup gives you the anchor point, the WAL fills in the gap, and the rehearsal pipeline proves the chain actually works.

What do I always validate?

  • archive_mode=on
  • Is archive_command actually checking errors?
  • Is the archive target in a separate failure domain? (writing to the same disk/cluster is risky)
  • Is there corruption checking on WAL files before they’re restored?

A simple but high-value rule: silent failure is not acceptable. If your archive command returns 0 while quietly leaving a file missing, you don’t actually have a backup.

3) The archive target: cheap storage isn’t enough, you need verifiable storage

Two failures show up in the field over and over:

  • WAL is written into the same infrastructure, and when the primary system goes, the archive goes with it
  • There’s object storage, but its integrity and access model are undefined

The right approach:

  • Write to a different failure domain
  • Run a checksum/integrity check on the object you wrote
  • Define a lifecycle: how many days hot, how many days archive?
  • Measure access speed during a restore

If you’re on object storage, just saying “it’s sitting there” isn’t enough. A wrong lifecycle policy on the relevant prefix, replication lag, or an access key issue will show up exactly on the day you need PITR.

4) Restore runbook: what do I do, and at what time?

Your PITR runbook has to make these decisions explicit:

  1. How will the target timestamp be chosen?
  2. Which base backup will be used?
  3. Will the restored copy be brought up on an isolated network?
  4. When will the application be moved to read-only or maintenance mode?
  5. When will traffic be cut over to the new copy?

Sample flow:

pg_basebackup --write-recovery-conf --pgdata /restore/pgdata
restore_command='cp /wal-archive/%f %p'
recovery_target_time='2026-04-29 10:37:00+03'

The command line on its own isn’t enough. The most critical decision is who signs off on the target timestamp. Roll back too early and you lose more data; roll back too late and you haven’t actually undone the bad changes.

5) Drills: the success criterion isn’t “the restore came up”

A good drill answers questions like these:

  • Did we actually return to the target time?
  • Did the application go through critical flows like login/checkout/order?
  • Have replication and the new backup chain been re-engaged?
  • How long did the whole thing take?

My recommendation is to wrap up the drill outcome under these headings:

  • Technical result: did the chain hold together?
  • Duration: did it meet RTO?
  • Decision gap: where did the team hesitate?
  • Action: what gets fixed before the next drill?

6) The most expensive mistakes

  • Believing the archive target “works” without ever doing a restore from it
  • Not rehearsing base-backup-and-WAL version mismatches
  • Letting only the DBA know the PITR runbook
  • Forgetting to start a new backup chain after recovery

The last one is especially critical. After a restore, the system comes back up, but if a fresh backup chain isn’t started you’ll be blind again the next time something fails.

Conclusion

PostgreSQL PITR is a test, not of the comfort of saying “we take backups,” but of the discipline of coming back. Run base backups, the WAL archive, and regular drills together as a trio and you can actually manage the data-loss window. What makes a difference in production isn’t memorizing commands; it’s nailing down the target time, the access model, and the recovery-aftermath chain.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts