Checksums
Updated June 6, 2026The Problem: How Do You Know Data Arrived Intact?
Imagine downloading a 4GB Ubuntu ISO. Hours later, the download finishes. But how do you know every single bit arrived correctly? Maybe a cosmic ray flipped a bit somewhere. Maybe the network dropped a packet and something got corrupted. Maybe, less charitably, the server was compromised and someone slipped malware into the file.
You can't read 4GB of binary data and spot an error with your eyes. You need a way to verify data integrity automatically. That's exactly what checksums are for.
A checksum is a short, fixed-length value derived from a piece of data. Run the same algorithm on the same data and you always get the same checksum. Change even a single bit of the data, and the checksum changes completely. This makes checksums an incredibly powerful tool for detecting corruption.
What happens to a checksum if a single bit in the original data is flipped?
How Checksums Work
The core idea is simple: take your data, feed it through an algorithm, and get a fingerprint out.
Sender computing and attaching a checksum fingerprint
checksum = algorithm(data)When you share data, you share the checksum alongside it. The recipient runs the same algorithm on the received data and compares the result:
- Checksums match: Data arrived intact
- Checksums differ: Data was corrupted or tampered with
Receiver recomputing and verifying checksum matches
The checksum itself is tiny, usually 32 to 256 bits, regardless of how large the original data is. That's the point. You can quickly verify gigabytes of data by comparing a handful of characters.
Common Checksum Algorithms
CRC32 (Cyclic Redundancy Check)
A fast, 32-bit algorithm designed specifically for detecting errors in network transmissions and storage. It's not a cryptographic hash; it's not designed to be tamper-proof. But it's extremely fast and excellent at detecting accidental corruption.
Where you'll find it: Ethernet frames, ZIP files, PNG images, TCP/UDP packets. Nearly every layer of the networking stack uses CRC for error detection.
Example output: e3d388a3 (8 hex characters = 32 bits)
CRC32 is a good choice for verifying file downloads from an untrusted source.
MD5 (Message Digest 5)
A 128-bit hash algorithm that was once widely used for security. It's cryptographically broken: attacks exist that can generate two different files with the same MD5 hash (a collision). Don't use MD5 for security purposes.
However, MD5 is still commonly used for non-security integrity checks, verifying file downloads when you trust the source, or checking if files changed in distributed systems. It's fast and widely supported.
Example output: d8e8fca2dc0f896fd7cb4cb0031ba249 (32 hex characters = 128 bits)
Why is MD5 still used in some systems despite being cryptographically broken?
SHA-256 (Secure Hash Algorithm)
A 256-bit cryptographic hash. Currently considered secure; no practical collision attacks exist. This is the gold standard for both integrity verification and security.
Where you'll find it: HTTPS certificates, Git commit hashes, Bitcoin proof-of-work, file verification. When you download software and the vendor provides a "SHA-256 checksum," this is what they mean.
Example output: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 (64 hex characters = 256 bits)
SHA-1
A 160-bit hash. Once widely used (TLS, Git, SVN). Now considered weak: collision attacks have been demonstrated (Google's SHAttered attack in 2017 produced two different PDFs with the same SHA-1 hash). Git still uses SHA-1 internally but is migrating to SHA-256.
| Algorithm | Output Size | Speed | Cryptographically Secure? | Use Case |
|---|---|---|---|---|
| CRC32 | 32 bits | Very fast | No | Network/storage error detection |
| MD5 | 128 bits | Fast | No (broken) | Non-security integrity checks |
| SHA-1 | 160 bits | Fast | Weak | Legacy systems (avoid for new systems) |
| SHA-256 | 256 bits | Moderate | Yes | Security, file verification |
| SHA-512 | 512 bits | Moderate | Yes | High-security applications |
Use Cases in the Real World
Network Packet Validation
Every Ethernet frame and TCP segment carries a checksum. When a packet arrives, the receiver recomputes the checksum and compares it. If it doesn't match, the packet is silently discarded and retransmission is requested (in TCP) or the packet is simply dropped (in UDP). This happens billions of times per second across the internet.
File Downloads and Distribution
Software releases almost always publish SHA-256 checksums alongside download links. After downloading, you verify:
sha256sum ubuntu-24.04-desktop-amd64.iso
# Compare output to the published checksumIf they match, the file is intact. If not, your download was corrupted (or the site was compromised).
Distributed Storage Systems
Checksum mismatch flagging corrupted data during transmission
Systems like AWS S3, HDFS, and Cassandra use checksums to detect silent data corruption (also called bit rot). When a file is stored, its checksum is stored alongside it. Periodic background processes re-read files and verify their checksums, flagging corruption for repair.
How do distributed storage systems like HDFS and Cassandra detect silent data corruption (bit rot)?
Git and Version Control
Every commit, tree, and blob in Git is identified by a SHA-1 hash (moving to SHA-256). The hash is computed from the content, which means:
- The same content always has the same hash
- Any change to content produces a completely different hash
- You can verify the integrity of the entire repository
Database Replication
When replicating data across nodes, databases sometimes use checksums to verify that replicas are in sync with the primary. A checksum mismatch signals that replication has drifted.
Checksums vs. Hashing vs. Encryption
These three terms are often confused:
Checksums are designed for error detection. They're fast, small, and optimized for detecting accidental corruption. Not necessarily designed to be tamper-proof.
Cryptographic hashes (SHA-256, etc.) are a subset of checksums designed to be collision-resistant and one-way. You can't reverse a hash to get the original data, and you can't easily find two inputs that produce the same hash. Used for security purposes: password storage, digital signatures, or certificate verification.
Encryption is fundamentally different: it is reversible. You encrypt data with a key, and the recipient decrypts it with a key. Encryption is about confidentiality (keeping data secret). Hashing is about integrity (detecting changes). HTTPS uses both: TLS encrypts your data, and digital certificates use cryptographic hashes to verify identity.
A common mistake: using MD5 for password storage. MD5 is a checksum; it's fast (bad for passwords, which you want to be slow to brute-force) and broken. Use bcrypt, Argon2, or scrypt for passwords. These are key derivation functions designed specifically for this use case.
Encryption and cryptographic hashing both protect data integrity.
Checksums in System Design
When designing distributed systems, you'll encounter checksums in a few recurring patterns:
- Content-addressable storage: store data by its hash. If two users upload identical files, you only store one copy. S3, Git LFS, and many backup systems work this way.
- ETags in HTTP: a server can return an ETag (often a hash of the content) for a resource. The browser caches it and sends
If-None-Match: <etag>on subsequent requests. If the content hasn't changed, the server returns304 Not Modifiedinstead of resending everything. - Idempotency keys: sometimes you hash request parameters to generate a unique key, ensuring duplicate requests aren't processed twice.
- Merkle trees: data structures that use hashes to efficiently verify large datasets. Bitcoin, Git, and Cassandra all use Merkle trees.
Which system design pattern stores data indexed by its hash so identical files are stored only once?
Summary
Checksums are fixed-length values derived from data using an algorithm; run the same algorithm on the same data and you always get the same result. They're the fundamental mechanism for detecting data corruption and verifying integrity. CRC32 is fast and great for network error detection. MD5 is fast but cryptographically broken; it is fine for non-security integrity checks. SHA-256 is the secure standard for anything security-sensitive. Checksums differ from encryption: encryption is reversible and ensures confidentiality, while checksums are one-way and ensure integrity. In distributed systems, you'll encounter checksums in storage verification, data deduplication, HTTP caching, and replication consistency checks.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices