Schema Migrations
Updated June 8, 2026Imagine you're managing a busy library. All the books are sorted by the author's last name. One day, the head librarian decides that sorting by genre, and then by author, makes more sense.
You can't just close the library for a week to reorganize all the shelves. People need books today. You have to figure out a way to completely reorganize the physical layout of thousands of books while patrons are actively walking around, checking books out, and returning them.
In software, reorganizing the structure of your data while the application is live is called a Schema Migration.
The Problem with Databases
Stateless web servers are easy to upgrade. If you have a bug, you kill the server, deploy new code, and spin up a new one.
Databases, however, have state. They hold terabytes of critical user data. If your new code requires a column to be named full_name instead of first_name and last_name, you can't just deploy the code. The moment the new code hits the database looking for full_name, the database will throw an error and crash your app.
You must migrate the schema. Doing this without taking the application offline (zero-downtime migration) is one of the trickiest maneuvers in backend engineering.
Expand: add full_name column, old app keeps working. | Dual Write + Backfill: app writes all 3 columns, background script fills historical rows.
The "Expand and Contract" Pattern
The golden rule of zero-downtime schema migrations is that you can never make a breaking database change and a code change at the exact same time. You must decouple them.
The industry-standard way to do this is the Expand and Contract pattern (also known as Parallel Change). It happens in four distinct phases.
Let's use the example of combining first_name and last_name into a single full_name column.
Phase 1: Expand (Database Change)
First, you alter the database to add the new full_name column. You do not delete the old columns. The database now has first_name, last_name, and an empty full_name.
Result: The live app keeps working normally, reading and writing to the old columns.
Phase 2: Dual Write (Code Change)
Next, you deploy an update to your application code. The code now writes to both the old columns and the new column simultaneously. When a user updates their profile, the app saves first_name, last_name, and also calculates and saves full_name.
Result: All new data is kept in sync across both structures.
Phase 3: Backfill (Background Script)
What about the millions of old users who haven't updated their profiles? You run a background script to iterate through the database, reading the old names, combining them, and filling in the full_name column for all historical records.
Phase 4: Read from New (Code Change)
Now that the new column is fully populated and staying up to date, you deploy another code update. The app stops reading from the old columns and exclusively reads from full_name.
Result: The app is now fully using the new schema.
Phase 5: Contract (Database Change)
Finally, after ensuring everything is stable for a few days, you run one last database migration to drop the first_name and last_name columns. The migration is complete.
Real-World Examples
GitHub's Massive Migrations
GitHub uses MySQL extensively. When they need to run an ALTER TABLE on a table with billions of rows, they can't just run the standard SQL command. Standard migrations lock the table, meaning GitHub would go down for hours. Instead, they use tools like gh-ost (GitHub's Online Schema Migrations for MySQL). These tools create a hidden duplicate table, migrate the schema there, sync the live data slowly in the background, and then do a split-second swap to the new table.
Stripe's API Versioning
Stripe is famous for never breaking their API for users. When they change their internal database schemas, they write complex compatibility layers. An API request from an old client will hit middleware that maps the old request format to the new database schema on the fly, ensuring smooth transitions without forcing merchants to upgrade their code instantly.
The Risks of Locking
The biggest danger in schema migrations is table locking.
In some relational databases (like older versions of PostgreSQL or MySQL), running a command like ALTER TABLE ADD COLUMN might require an exclusive lock on the entire table. If that table holds your core Users data, the database will block all reads and writes until the command finishes.
If the table is 500GB, that command might take 45 minutes. Congratulations, you just caused a 45-minute total system outage.
[!CAUTION] Always know how your specific database engine handles locks! Modern versions of Postgres can add a nullable column without a heavy lock, but adding a column with a
DEFAULTvalue or creating a new index can still cause catastrophic locking if not done concurrently.
Summary
Schema migrations are the delicate surgery of system design. Because state is persistent, you must orchestrate a careful sequence between your application code and your database structure. By using the Expand and Contract pattern and online schema change tools, you can completely reorganize the foundation of your data layer without your users noticing a hiccup. It takes patience and multiple deployments, but zero-downtime is always worth the effort.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices