Online Schema Migration: Expand/Contract, Backfill, and Dual Write

Database schema changes (adding/removing a column, indexes, constraints, type conversions) are often viewed by teams as “run the DDL and move on.” At enterprise scale, however, a schema change turns into an incident the moment it isn’t considered together with deploy, traffic, replication, lock behavior, and rollback.

This post describes a framework that has worked many times in production for teams that say “I don’t want downtime but I want to make a safe change”: expand/contract, backfill, and where needed, dual write.

The goal: not “zero downtime” but “controlled risk”

My realistic objective is this:

No user-visible outage (or at most, a very brief and planned one)
The change is reversible
Signals and guardrails are defined

What is expand/contract?

The expand/contract approach makes a schema change in two (sometimes three) phases instead of a single step.

Expand: add the new field, don’t break the old one
Migrate: move the data, switch the application to the new field
Contract: remove the old field / tighten the constraint

The strength of this approach is here:

The application and the DB don’t make a “sharp pivot” at the same time.
Rollback becomes possible: you can move the application back to the old field.

1) Expand phase: add the new field, don’t break

Typical steps:

Add the new column (one that accepts NULL)
Create the new index using a “non-blocking” method (depends on the DB engine)
Add the new table/relation (without breaking the old path)

The rule of this phase: the previous application version must keep working.

Risk points:

DDLs like “DEFAULT + NOT NULL” can take long locks on some databases
Creating an index on a large table can push replication behind

2) Backfill: moving data isn’t an ETL

Backfill is often the real cost. I’ve seen these mistakes a lot in the field:

Locking the table with a “single-shot update”
Blowing up replication lag during backfill
Not monitoring the backfill (how much is left, what rate, what error rate?)

My practical recommendation:

Do backfill in batches (e.g. 1k–10k rows)
Short pause between batches (let the DB breathe)
Persist progress idempotently (checkpoint)
Be able to resume on failure (resume)

Sample “batch backfill” logic (pseudocode):

-- son_islenen_id'yi bir yerde tut
UPDATE my_table
SET new_col = transform(old_col)
WHERE id > :last_id
ORDER BY id
LIMIT :batch_size;

3) Dual write: a last resort, but sometimes necessary

Dual write is when the application writes both to the old and new field for a period. It’s powerful but risky:

Writing to two places = two different failure modes
Partial write (wrote to one, couldn’t write to the other)
Inconsistencies via ordering and retry

If you’re going to use dual write:

Define the “source of truth” (which one is authoritative?)
Produce an inconsistency-detection signal (reconciliation job)
Keep the dual-write window short (it shouldn’t last weeks)

Contract phase: produce “evidence” before removing the old

Before moving to the contract step, I look for these pieces of evidence:

Usage rate of the new column/table is 99%+ (read path)
The old field is no longer read (verified with logs/metrics)
The reconciliation job reports 0 mismatches
The rollback plan was reviewed one last time

Examples of contract:

Drop the old column
Add the NOT NULL constraint
Tighten the unique constraint

Observability: what to monitor during the migration

My minimum monitoring list:

DB: CPU, I/O, lock waits, slow queries
Replication lag (if applicable)
Application: error rate, latency (especially the write path)
Backfill job: batch/s, remaining rows, failures

Alerting principle:

Temporarily set alert thresholds for metrics that are “normally quiet” during the migration
Define a trigger threshold for rollback

Rollback plan: not on paper, step by step

A rollback plan should include:

Application: returning to the previous version (feature flag / config)
DB: the new column can stay (rollback in the expand phase usually isn’t “drop the schema”)
Data: will the backfill be reverted, or left in place?

The approach I use often:

In the expand phase, rollback = revert the application, leave the schema
Before the contract phase, rollback = don’t do the contract, keep living with the debt

Conclusion

Online schema migration isn’t the sum of DDL commands; it’s an operational process. Reduce risk with expand/contract, manage backfill with a capacity budget, use dual write briefly and under control, and close the debt with contract. Most importantly: measure every phase, prove it, and make rollback “actually executable.”

Online Schema Migration: Expand/Contract, Backfill, and Dual Write

The goal: not “zero downtime” but “controlled risk”

What is expand/contract?

1) Expand phase: add the new field, don’t break

2) Backfill: moving data isn’t an ETL

3) Dual write: a last resort, but sometimes necessary

Contract phase: produce “evidence” before removing the old

Observability: what to monitor during the migration

Rollback plan: not on paper, step by step

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Feature Flags and Configuration Governance: Parameter Store and Audit

Retry Storms: Timeout Budget and Latency Amplification

Sticky Sessions and Load Balancer Decisions for Stateful Traffic

The goal: not “zero downtime” but “controlled risk”

What is expand/contract?

1) Expand phase: add the new field, don’t break

2) Backfill: moving data isn’t an ETL

3) Dual write: a last resort, but sometimes necessary

Contract phase: produce “evidence” before removing the old

Observability: what to monitor during the migration

Rollback plan: not on paper, step by step

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Feature Flags and Configuration Governance: Parameter Store and Audit

Retry Storms: Timeout Budget and Latency Amplification

Sticky Sessions and Load Balancer Decisions for Stateful Traffic

Klavye Kısayolları