Database schema changes (adding/removing a column, indexes, constraints, type conversions) are often viewed by teams as “run the DDL and move on.” At enterprise scale, however, a schema change turns into an incident the moment it isn’t considered together with deploy, traffic, replication, lock behavior, and rollback.
This post describes a framework that has worked many times in production for teams that say “I don’t want downtime but I want to make a safe change”: expand/contract, backfill, and where needed, dual write.
The goal: not “zero downtime” but “controlled risk”
My realistic objective is this:
- No user-visible outage (or at most, a very brief and planned one)
- The change is reversible
- Signals and guardrails are defined
What is expand/contract?
The expand/contract approach makes a schema change in two (sometimes three) phases instead of a single step.
- Expand: add the new field, don’t break the old one
- Migrate: move the data, switch the application to the new field
- Contract: remove the old field / tighten the constraint
The strength of this approach is here:
- The application and the DB don’t make a “sharp pivot” at the same time.
- Rollback becomes possible: you can move the application back to the old field.
1) Expand phase: add the new field, don’t break
Typical steps:
- Add the new column (one that accepts
NULL) - Create the new index using a “non-blocking” method (depends on the DB engine)
- Add the new table/relation (without breaking the old path)
The rule of this phase: the previous application version must keep working.
Risk points:
- DDLs like “DEFAULT + NOT NULL” can take long locks on some databases
- Creating an index on a large table can push replication behind
2) Backfill: moving data isn’t an ETL
Backfill is often the real cost. I’ve seen these mistakes a lot in the field:
- Locking the table with a “single-shot update”
- Blowing up replication lag during backfill
- Not monitoring the backfill (how much is left, what rate, what error rate?)
My practical recommendation:
- Do backfill in batches (e.g. 1k–10k rows)
- Short pause between batches (let the DB breathe)
- Persist progress idempotently (checkpoint)
- Be able to resume on failure (resume)
Sample “batch backfill” logic (pseudocode):
-- son_islenen_id'yi bir yerde tut
UPDATE my_table
SET new_col = transform(old_col)
WHERE id > :last_id
ORDER BY id
LIMIT :batch_size;
3) Dual write: a last resort, but sometimes necessary
Dual write is when the application writes both to the old and new field for a period. It’s powerful but risky:
- Writing to two places = two different failure modes
- Partial write (wrote to one, couldn’t write to the other)
- Inconsistencies via ordering and retry
If you’re going to use dual write:
- Define the “source of truth” (which one is authoritative?)
- Produce an inconsistency-detection signal (reconciliation job)
- Keep the dual-write window short (it shouldn’t last weeks)
Contract phase: produce “evidence” before removing the old
Before moving to the contract step, I look for these pieces of evidence:
- Usage rate of the new column/table is 99%+ (read path)
- The old field is no longer read (verified with logs/metrics)
- The reconciliation job reports 0 mismatches
- The rollback plan was reviewed one last time
Examples of contract:
- Drop the old column
- Add the
NOT NULLconstraint - Tighten the unique constraint
Observability: what to monitor during the migration
My minimum monitoring list:
- DB: CPU, I/O, lock waits, slow queries
- Replication lag (if applicable)
- Application: error rate, latency (especially the write path)
- Backfill job: batch/s, remaining rows, failures
Alerting principle:
- Temporarily set alert thresholds for metrics that are “normally quiet” during the migration
- Define a trigger threshold for rollback
Rollback plan: not on paper, step by step
A rollback plan should include:
- Application: returning to the previous version (feature flag / config)
- DB: the new column can stay (rollback in the expand phase usually isn’t “drop the schema”)
- Data: will the backfill be reverted, or left in place?
The approach I use often:
- In the expand phase, rollback = revert the application, leave the schema
- Before the contract phase, rollback = don’t do the contract, keep living with the debt
Conclusion
Online schema migration isn’t the sum of DDL commands; it’s an operational process. Reduce risk with expand/contract, manage backfill with a capacity budget, use dual write briefly and under control, and close the debt with contract. Most importantly: measure every phase, prove it, and make rollback “actually executable.”