Sharding

Updated June 3, 2026
M
Magic Magnets Team
10 min read

At some point, a single database server runs out of road. You've added indexes, tuned your queries, thrown more RAM at it, upgraded to the biggest instance your cloud provider offers, and you're still hitting limits. Writes are queuing. Reads are slow. Your database is the bottleneck for everything.

This is when engineers reach for sharding: splitting your data horizontally across multiple database instances, each holding a slice of the total dataset.

What Sharding Actually Means

Sharding means taking one logical database and distributing its rows across multiple physical databases, called shards. Each shard holds a subset of the data and handles only the queries for that subset.

If you have 1 billion users and 10 shards, each shard holds roughly 100 million users. A write to user 428,000,102 goes to one specific shard. A read for that user goes to the same shard. The other 9 shards aren't involved.

Contrast this with replication, where every replica holds all the data. Sharding is about partitioning — each shard has data the others don't. This is what enables sharding to scale writes beyond what any single machine can handle.

Quiz Time

How does sharding differ from replication in terms of data distribution?

Sharding Strategies

The critical decision in sharding is: how do you decide which shard a piece of data belongs to? This is called the shard key or partition key, and getting it wrong is expensive.

Range-Based Sharding

Split the keyspace into contiguous ranges and assign each range to a shard.

Shard 0: user_id 0 - 99,999,999 Shard 1: user_id 100,000,000 - 199,999,999 Shard 2: user_id 200,000,000 - 299,999,999 ...

The good: Range queries are efficient. If you want all users created in January 2024, all that data lives in a contiguous range on one or two shards. Easy to reason about.

The bad: Range-based sharding creates hotspots. New users always get the highest IDs, so they always land on the last shard. That shard gets all the new-user writes while older shards sit idle. Range sharding works well for time-series data where historical data is immutable, but struggles with primary keys.

Quiz Time

Why does range-based sharding on a monotonically increasing primary key (like user_id) create hotspots?

Hash-Based Sharding

Apply a hash function to the shard key and take the modulo with the number of shards:

shard = hash(user_id) % num_shards

This distributes data uniformly across shards — no hotspots, every shard gets roughly equal load. Most production systems use hash-based sharding for this reason.

The bad: Range queries become cross-shard. If you want all users in a city, you have to query all shards and merge results. Also, adding or removing shards requires rehashing almost all keys, a painful migration known as resharding.

Consistent hashing (used by DynamoDB, Cassandra, Redis Cluster) solves the resharding problem by organizing shards on a ring and minimizing key movement when shards are added or removed.

Quiz Time

Consistent hashing eliminates resharding entirely.

Directory-Based Sharding

Maintain a lookup table (the shard directory) that maps each entity to its shard:

user 1,000,042 → shard 3 user 9,382,001 → shard 7 ...

The good: Maximum flexibility. You can move specific users between shards, handle large accounts differently, and rebalance without touching most data.

The bad: The directory itself becomes a single point of failure and a bottleneck. Every query requires a directory lookup before it can even reach the right shard. You need to cache the directory aggressively and replicate it carefully.

Quiz Time

What is the main operational cost of directory-based sharding compared to hash-based sharding?

Real-World Examples

How Instagram Sharded PostgreSQL

Instagram's early backend was a single PostgreSQL instance. As they grew toward 100 million users, they pre-sharded their database into thousands of logical shards before they needed physical sharding. Here's the clever part: those thousands of logical shards were distributed across just a handful of physical PostgreSQL servers.

When they needed more capacity, they moved logical shards to new physical servers — no rehashing required, just moving slices. This decoupling of logical from physical sharding is now a common pattern. It buys you the flexibility of resharding without rewriting your shard key logic.

They used user_id as the shard key, since almost all Instagram data (photos, followers, likes) is scoped to a user. A photo's shard is determined by the user_id of its owner, not the photo's own ID.

Quiz Time

Instagram sharded by photo_id rather than user_id so each photo's data stays together.

Cassandra's Native Sharding

Cassandra was designed for sharding from day one. Every row has a partition key that determines which node(s) hold it. Cassandra uses consistent hashing internally — add a new node, and Cassandra automatically migrates the relevant token ranges to it.

This is why Cassandra is popular for write-heavy workloads like time-series telemetry, activity feeds, and IoT sensor data. You can scale writes linearly by adding nodes, with no application-level resharding logic required.

The tradeoff: Cassandra sacrifices strong consistency (it's eventually consistent by default) and makes cross-partition queries difficult by design.

The Hard Problems

Sharding introduces a class of problems that don't exist in a single-database world.

Cross-Shard Queries

In a sharded system, a JOIN between two tables might need data from different shards. The database can't do this natively — your application has to query multiple shards, then join the results in memory. This is called a scatter-gather query, and it gets slow as shard count grows.

This is why shard key selection matters so much. Pick a key that co-locates data that's often queried together. Instagram putting all a user's data on the same shard means most queries touch exactly one shard.

Hotspots

Even with hash-based sharding, hotspots emerge from skewed access patterns. If you're sharding tweets by user_id and Barack Obama has 130 million followers, every read of his feed hits the same shard. The shard key distributes data evenly but not access evenly.

Mitigation strategies include: caching aggressively for hot entities, splitting hot accounts into sub-shards, or adding read replicas to high-traffic shards.

Resharding

Adding new shards after you've already launched is one of the most painful operations in distributed systems. With naive hash sharding, adding one shard moves half your data. Consistent hashing minimizes this, but it's still an operational challenge — you need to migrate data while the system is live, maintain consistency during the migration, and update routing logic.

This is why teams often over-shard at launch — start with more shards than you currently need — so you can add physical capacity by simply assigning existing logical shards to new machines.

No Cross-Shard Transactions

ACID transactions that span multiple shards don't exist without distributed transaction protocols like two-phase commit (2PC). 2PC adds latency and failure complexity most teams don't want. In practice, sharded systems relax cross-shard consistency — they design the data model so that operations that must be atomic always touch a single shard.

Quiz Time

Why do many sharded systems relax cross-shard ACID guarantees instead of using two-phase commit (2PC)?

Is Sharding the Right Answer?

Before sharding, exhaust simpler options:

  1. Vertical scaling: bigger machine, more RAM, faster SSDs
  2. Read replicas: offload read traffic to replicas
  3. Caching: Redis or Memcached in front of the database
  4. Vertical partitioning: move blobs and cold data out
  5. Archiving: move old data to cheaper storage, keep the hot table small

Sharding significantly increases operational complexity. Many companies reach hundreds of millions of users on a single well-tuned Postgres or MySQL instance with read replicas, and never need to shard.

Summary

Sharding splits data horizontally across multiple database instances, each owning a subset of rows. The three main strategies are range-based (good for range queries, bad for hotspots), hash-based (good for distribution, bad for range queries and resharding), and directory-based (maximum flexibility, adds a lookup tier). The critical decision is the shard key — it determines data locality, query patterns, and hotspot risk. Real systems like Instagram and Cassandra show both the power of sharding done right and the years of engineering required to get there. Reach for sharding only after simpler scaling strategies are exhausted, and design your data model around your shard key from the start.

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices