LSM Trees

Updated June 3, 2026

Magic Magnets Team

11 min read

If you just read about B-Trees, you might think the database storage problem is perfectly solved. B-Trees are amazing for fast lookups. But here's the thing: B-Trees have a massive weakness. They are terrible at handling a firehose of incoming data.

Imagine you are building a system for Uber to track driver locations, or Amazon tracking millions of clickstream events per second. You are doing constant, high-volume writes.

When you write to a B-Tree, it has to find the exact right spot in the tree on disk, update that block, and potentially rebalance the tree. This involves lots of random, tiny disk writes. Random disk writes are painfully slow. Doing them millions of times a second will choke your database.

We need a completely different approach for write-heavy systems. Enter the Log-Structured Merge-Tree (LSM Tree).

The Core Concept: Append Only

If random writes are slow, what's fast? Sequential writes.

Think of keeping a physical ledger book. If you have to insert a transaction alphabetically on page 42, then page 18, then page 99, it takes forever. But if you just start at the end of the book and write every new transaction sequentially as it happens. The speed is dramatic.

LSM Trees are built on this "append-only" philosophy. They never update data in place.

Quiz Time

Why do B-Trees struggle under high write throughput compared to LSM Trees?

How an LSM Tree Works: The Two-Part System

An LSM Tree splits the data into two main locations: memory (RAM) and disk.

1. The MemTable (In-Memory)

When a write request comes in, the database doesn't go to the disk immediately. Instead, it writes the data into a structure in memory (RAM) called a MemTable (often implemented as an in-memory balanced tree like a Red-Black tree).

Because writing to RAM is practically instantaneous, the database can ingest data at terrifying speeds. It sorts the data in memory as it arrives.

Quiz Time

LSM Trees update data in place on disk, similar to B-Trees.

2. Flushing to Disk (SSTables)

Eventually, that MemTable in RAM fills up. When it does, the database freezes it, creates a new blank MemTable for new writes, and flushes the full one onto the hard disk.

Because the data in the MemTable was already sorted, the database can just stream it to disk in one fast, sequential write. The resulting file on disk is called a Sorted String Table (SSTable). Once an SSTable is written, it is immutable. It is never changed.

Quiz Time

What is an SSTable in an LSM Tree?

The Catch: How Do We Read?

Writes are lightning fast, but reading gets complicated.

If a user requests the data for "Alice", the database has to check:

Is "Alice" in the current MemTable in RAM?
If not, it has to look through the SSTable files on disk. Since we create a new file every time RAM fills up, there could be hundreds of SSTables on disk!

Searching hundreds of files for every read would be a disaster. To mitigate this, LSM Trees use a few tricks:

Bloom Filters: A clever mathematical structure in memory that can quickly tell you "Alice is definitely NOT in this file" or "Alice MIGHT be in this file." This prevents the database from reading disk files unnecessarily.
Compaction: In the background, the database constantly runs a "compaction" process. It takes multiple smaller SSTables, merges them together, removes overwritten or deleted data, and writes out a single, larger, sorted SSTable. Think of it like a background janitor cleaning up and consolidating files while the main system runs.

Quiz Time

What role do Bloom Filters play during reads in an LSM Tree?

Quiz Time

Compaction in an LSM Tree runs in the foreground, pausing all incoming writes until it completes.

Deletes in an Append-Only World: The Tombstone

Wait, if files on disk are immutable, how do you delete a record?

You write a new record. If you want to delete "Alice", you just insert a new entry into the MemTable that says "Alice: [DELETED]". This special marker is called a Tombstone.

When a read request comes in, the system finds the Tombstone first (since it's newer) and knows "Alice" is gone, even if her old data still exists in an older SSTable on disk. Eventually, the background compaction process will see the Tombstone, match it with the old data, and permanently erase both from the new consolidated file.

Quiz Time

How does an LSM Tree handle a delete operation?

Real-World Examples

LSM Trees are the backbone of modern NoSQL and massive-scale databases designed for high write throughput.

Cassandra & ScyllaDB: Used by companies like Discord and Netflix to handle billions of messages and metrics per second.
RocksDB: A wildly popular embeddable key-value store built by Facebook, optimized for fast storage.
DynamoDB: Amazon's massively scalable database under the hood uses LSM-like structures to handle extreme traffic spikes.

Summary

B-Trees are optimized for fast reads but suffer under heavy write loads due to random disk I/O.
LSM Trees are optimized for blazing-fast writes by using sequential append-only operations.
Data is first written to a MemTable in RAM. When full, it is sequentially flushed to disk as an immutable SSTable.
Reads check memory first, then disk. To speed up reads, LSM Trees use Bloom Filters and background Compaction.
Deletes are handled by writing a special Tombstone marker, which hides old data until background processes physically remove it.

How Databases Guarantee Durability

How helpful was this content?

Comments

0/2000

Saved on this device only