Data Compression
Updated June 3, 2026Have you ever tried to stuff an oversized sleeping bag into a tiny stuff sack? It takes effort, but once it's in, it's much easier to carry around. That's data compression in a nutshell.
In system design, moving data around the network is expensive, both in terms of time and actual dollars. Storage is cheaper than it used to be, but it's still not free. That's why we compress data.
The Core Concept: Why Compress?
Think of it this way: if you have a message that says "AAAAABBBBB", why send those 10 characters when you could just say "5A5B"? You've just reduced the size of your message by 60%.
When we talk about data compression in large-scale systems, we're usually making a trade-off. We are trading CPU cycles (the compute power needed to compress and decompress) for network bandwidth and storage space.
[!NOTE] CPU is often much faster than network I/O. Taking a few milliseconds of CPU time to compress a payload can save hundreds of milliseconds of network transfer time.
What fundamental trade-off does data compression make in a large-scale system?
Two Main Flavors: Lossless vs. Lossy
1. Lossless Compression
With lossless compression, when you decompress the data, you get the exact original data back. Not a single bit is out of place.
- Where it's used: Text files, database records, configuration files, executables.
- Real-world examples:
- Google uses Brotli and Gzip to compress web assets (HTML, CSS, JS) before sending them to your browser.
- Databases like Cassandra or PostgreSQL use algorithms like LZ4 or Snappy to compress blocks of data on disk.
Which of the following is the most appropriate use case for lossless compression?
2. Lossy Compression
With lossy compression, the decompressed data is a close approximation of the original, but some of the less important details are thrown away forever.
- Where it's used: Images, video, audio.
- Real-world examples:
- Netflix uses highly advanced lossy video compression. When you watch a movie, you aren't seeing every single original pixel from the master copy. The compression algorithm throws away data that the human eye barely notices.
- Spotify compresses audio streams so you can listen seamlessly on a mobile connection.
When Netflix streams a movie, the video you watch is a pixel-perfect copy of the original master file.
Common Compression Algorithms You Should Know
If you're building a system, which algorithm should you choose? It depends on what you value most.
| Algorithm | Speed | Compression Ratio | Best For |
|---|---|---|---|
| Gzip | Medium | Good | Web traffic, general purpose file compression |
| LZ4 | Very Fast | Low | High-throughput systems, databases, real-time logging |
| Snappy | Very Fast | Low | Developed by Google; used in MapReduce, BigTable, Cassandra |
| Zstandard (Zstd) | Fast | Very Good | Developed by Facebook; arguably the best modern general-purpose compressor |
| Brotli | Slow (Encode) / Fast (Decode) | Excellent | Static web assets where you encode once and decode millions of times |
A high-throughput logging pipeline needs to compress billions of events per second with minimal CPU overhead. Which algorithm is the best fit?
Brotli is a good choice for compressing real-time API responses because it has very fast encoding speed.
Summary
- Compression trades CPU cycles for storage space and network bandwidth.
- Lossless preserves every bit. Lossy sacrifices detail for massive space savings.
- Algorithms like LZ4 and Snappy prioritize speed, making them great for databases.
- Don't bother compressing data that's already compressed (like media files).
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices