Change Data Capture (CDC)

Updated June 6, 2026

Magic Magnets Team

9 min read

Every insert, update, and delete that ever touched your database is a fact. A thing that happened. Most databases quietly write these facts to an internal log before they even modify the actual data — it's how they guarantee durability and support replication.

Change Data Capture (CDC) is the practice of tapping into that log to stream those facts outward — to search engines, caches, message queues, analytics systems, or anywhere else that needs to know what changed and when.

What CDC Actually Is

The core idea: instead of your application explicitly notifying every downstream system when it writes to the database, you let the database's own change log do it automatically.

Your application writes to the database as normal. A CDC tool watches the database's internal change log and publishes those changes as events. Downstream consumers subscribe to those events and react.

algobase.dev

CDC taps into the database's own internal change log — the Write-Ahead Log in PostgreSQL, the binlog in MySQL. These logs already exist for crash recovery and replication. Debezium connects as a logical replication client and reads every row-level change as a structured event. It publishes these events to Kafka, organized by table name. Downstream consumers subscribe to the topics they care about: an Elasticsearch connector updates the search index, a cache invalidation service clears stale Redis keys, and a warehouse connector syncs to BigQuery or Snowflake. Critically, the application code has no knowledge of any of this. It just writes to the database. The database log does the rest — making the database the authoritative, single source of truth for all downstream synchronization.

1 / 1

CDC architecture — Debezium reads the WAL and publishes change events to Kafka for downstream consumers

This is powerful because the database is the single source of truth. Instead of trying to keep every downstream system in sync through application-level hooks (which are easy to miss, easy to get wrong, and hard to test), CDC treats the database log as the authoritative stream of truth.

Quiz Time

What database mechanism does Debezium use to capture changes from PostgreSQL?

How It Works: Database Logs

PostgreSQL: Write-Ahead Log (WAL)

In Postgres, every write is recorded in the WAL before it's applied to the actual data files. The WAL is what lets Postgres survive crashes (it can replay from the log) and what powers replication (replicas replay the WAL from the primary).

CDC tools connect to Postgres as a logical replication client. Postgres decodes the WAL into a logical stream of row-level changes — INSERT, UPDATE, DELETE — and streams them out. Debezium uses this mechanism to capture changes from Postgres.

You need to configure Postgres with wal_level = logical and set appropriate replication slots.

MySQL: Binary Log (binlog)

MySQL's equivalent is the binary log (binlog). It records every statement or row change (depending on the binlog format). MySQL replication works by having replicas consume the primary's binlog — CDC tools do the same thing.

Debezium connects to MySQL as a replica, reads the binlog, and converts changes into structured events. You need to enable binary logging (log_bin) and use row-based format (binlog_format = ROW).

SQL Server: CDC Feature

SQL Server has CDC built in as a first-class feature. You enable it per table, and SQL Server creates a corresponding change table that logs all modifications. CDC tools query this change table.

MongoDB: Change Streams

MongoDB exposes a native change stream API on top of its oplog (operations log). Applications can subscribe to change streams directly or via a CDC connector.

Use Cases

Syncing to Elasticsearch

Search is a classic CDC use case. Your source of truth is Postgres. Elasticsearch is your search layer. How do you keep the search index in sync?

The naive approach: every time your application updates a product in the database, also update Elasticsearch. This is fragile — what if the Elasticsearch update fails? What if a developer forgets to add the sync call? What if you need to backfill?

With CDC: Debezium watches the Postgres WAL for changes to the products table. Every change is published to Kafka. An Elasticsearch connector reads from Kafka and updates the index. The application code has zero knowledge of Elasticsearch — it just writes to the database.

Cache Invalidation

Cache invalidation is famously hard. CDC makes it tractable. When a row changes in the database, a CDC event fires, your cache invalidation service picks it up and removes or refreshes the corresponding cache key — automatically, without the application needing to know which caches exist.

This is the approach used in Facebook's Wormhole system: changes from their database are streamed via CDC and used to invalidate their distributed cache layer.

Building Event Streams from Existing Systems

Sometimes you want to add event-driven architecture to an existing system without rewriting everything. CDC lets you treat the database as an event source. Every insert to the orders table becomes an "order created" event. Every update to payment_status becomes a "payment updated" event. These events flow into Kafka and feed downstream services — without changing a line of application code.

This is how companies gradually migrate from monoliths to event-driven microservices: CDC provides the event stream from the existing database while new services are built to consume it.

Data Warehouse / Analytics Sync

Moving data from operational databases (OLTP) to analytics warehouses (BigQuery, Snowflake, Redshift) is traditionally done with batch ETL jobs — run every hour or night. CDC enables near-real-time sync. Changes flow into the warehouse as they happen, enabling up-to-the-minute analytics.

Audit Logs

Every change to sensitive data — user records, financial transactions, permissions — captured automatically without trusting application code to log it correctly. CDC gives you a complete history of every database change.

Quiz Time

Why is CDC a better approach to cache invalidation than having application code explicitly invalidate cache entries?

Real Tools

Debezium

The open-source CDC platform, built on top of Kafka Connect. Supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and more. Debezium connectors run as Kafka Connect plugins, reading from database logs and writing change events to Kafka topics.

Debezium handles the complexity of database log formats, schema changes, and snapshotting (doing an initial full-table sync before streaming incremental changes). It's the most widely used CDC tool in open-source ecosystems.

One gotcha: Debezium requires Kafka. If you're not already running Kafka, that's significant infrastructure to add.

AWS Database Migration Service (DMS)

AWS's managed CDC service. Supports continuous replication from source databases (RDS, Aurora, on-premises databases) to target systems (S3, Kinesis, Redshift, other databases). No infrastructure to manage.

Useful for AWS-native setups where you want to stream database changes to Kinesis (for downstream consumers) or to Redshift (for analytics). Less flexible than Debezium but operationally much simpler.

Kafka Connect with JDBC Connectors

Kafka Connect's JDBC source connector is a simpler (but less powerful) alternative. Instead of reading from the database log, it polls tables for rows that have changed since the last poll (using a timestamp or incrementing ID column). This is CDC-lite — it works without special database permissions or log access, but it misses deletes, has higher latency, and requires schema design to support it.

Airbyte and Fivetran

Data integration platforms that support CDC as a sync mode for moving data into data warehouses. Good for analytics use cases, less useful for operational CDC (syncing caches, building event streams).

The Hard Parts

Schema Changes

When you add a column to a database table, what happens to the CDC stream? Tools like Debezium handle this, but you need to think about schema evolution — consumers of the event stream may not be ready for new fields. Schema registries (like Confluent Schema Registry) help manage this.

Initial Snapshot

When you first set up CDC, you need to sync the current state of the database before streaming incremental changes. This initial snapshot can take hours on large tables, and you need to handle the transition from snapshot to streaming without losing or duplicating events.

At-Least-Once Delivery

CDC events are typically delivered at-least-once. If Debezium crashes mid-stream, it may re-deliver some events on restart. Consumers need to be idempotent.

Quiz Time

What is the initial snapshot problem in CDC, and why does it matter?

Summary

Change Data Capture treats your database's internal change log as a first-class event stream. By tapping into the WAL (Postgres), binlog (MySQL), or equivalent, CDC tools like Debezium capture every insert, update, and delete as a structured event and stream it to Kafka, search engines, caches, or data warehouses. The key insight: the database is already recording these changes. CDC just makes those changes available to the rest of your system. This enables real-time search index sync, automatic cache invalidation, gradual migration to event-driven architecture, and up-to-the-minute analytics — all without changing a line of application code.

Delivery Semantics

How helpful was this content?

Comments

0/2000

Saved on this device only