Latency vs Throughput vs Bandwidth
Updated June 3, 2026Three terms come up constantly in system design conversations: latency, throughput, and bandwidth. They're related but distinct, and confusing them leads to the wrong solutions. Let's make each one crystal clear.
The Highway Analogy
Think of a highway:
- Bandwidth is the number of lanes, representing the maximum capacity of the road
- Throughput is the actual number of cars passing through per hour, representing the actual traffic being handled
- Latency is how long it takes a single car to travel from point A to point B
A 10-lane highway has high bandwidth. At 2am with no traffic, throughput is low but latency is also low, meaning your car gets there fast. At rush hour, throughput approaches the bandwidth limit, and latency goes up because everyone is stuck in traffic.
This analogy holds surprisingly well for computer systems.
A highway has 10 lanes but only a few cars are on it at 3am. Which statement best describes this situation?
Bandwidth
Bandwidth is the maximum data transfer rate of a network link, typically measured in Mbps (megabits per second) or Gbps. It's a physical constraint of the medium.
A fiber connection might give you 10 Gbps of bandwidth. That doesn't mean you're always pushing 10 Gbps; instead, it means you can't push more than 10 Gbps no matter what you do.
In system design, bandwidth shows up as a constraint when you're:
- Designing replication between database nodes (will the replica keep up with the primary?)
- Estimating CDN costs (how much data will we serve per month?)
- Planning for video streaming (what bitrate can we sustain per user?)
Throughput
Throughput is the actual amount of work done per unit of time, such as requests per second, transactions per second, or bytes per second. Unlike bandwidth (which is the ceiling), throughput is what you actually observe.
Throughput is limited by the weakest link in your system. You can have 10 Gbps of network bandwidth, but if your database can only handle 1,000 queries per second, that's your throughput ceiling for read-heavy workloads.
Improving throughput usually means:
- Horizontal scaling (more workers handling requests in parallel)
- Caching (serve more requests without hitting the bottleneck)
- Query optimization (each unit of work takes less time)
- Batching (amortize fixed overhead across multiple operations)
Your system has 10 Gbps of network bandwidth but your database handles only 500 queries per second. What is the effective throughput ceiling for read-heavy workloads?
Load Balanced Throughput: parallel server instances tripling request throughput under load
Latency
Latency is the time it takes to complete a single operation, from request to response. It's measured in milliseconds (ms) or microseconds (µs).
Latency has a hard floor: the speed of light. A round trip between New York and London is about 70ms at best; that's just physics. You can't engineer your way past it. Everything else (processing time, queueing delays, serialization) adds on top.
Sources of latency:
- Network latency: physical distance + routing hops
- Processing latency: CPU time to actually do the work
- Queueing latency: time spent waiting when the system is busy
- Disk I/O latency: reading from SSDs (µs) vs HDDs (ms) vs network storage (ms-s)
The speed of light sets a hard minimum floor on network latency that cannot be reduced through software optimization.
Single Server Latency: request and query round trips showing typical P50 vs outlier P99 latencies
The Latency-Throughput Trade-off
Here's where it gets interesting: latency and throughput often trade off against each other.
The classic example is batching. If you send database writes one at a time, each write has low latency (it completes quickly). But your throughput is limited since you can only do one write at a time, waiting for each to finish.
If instead you batch 100 writes together and send them in one transaction, your throughput skyrockets because you've amortized the overhead. But the first write in the batch now has to wait for 99 others before being committed. Higher throughput, higher latency.
This trade-off shows up everywhere:
- Kafka batches messages for high throughput at the cost of some latency
- TCP Nagle's algorithm buffers small packets to reduce overhead, which is great for throughput but bad for interactive applications
- Database connection pooling increases throughput but adds queuing latency under load
Batching 100 writes into a single transaction always reduces overall system latency.
Batching Trade-off: Write Queue and Batch Writer boosting throughput at the cost of waiting latency
P50, P95, P99: Why Averages Lie
Here's a trap almost everyone falls into: using average latency as your performance metric.
Imagine a service where 99% of requests complete in 10ms and 1% take 10 seconds. The average might look like 110ms, which seems "acceptable." But 1% of your users are experiencing catastrophic slowness. At 1,000 requests per second, that's 10 users per second having a terrible time.
This is why the industry uses percentile latencies:
- P50 (median): Half of requests are faster than this, half are slower, representing the "typical" experience.
- P95: 95% of requests complete faster than this, while 5% are slower.
- P99: 99% of requests complete faster than this. Only 1% are slower, but at scale, 1% is a lot of people.
- P99.9: The "tail" latency or extreme outliers, often caused by garbage collection pauses, lock contention, or cold cache misses.
A service processes 1,000 requests per second. P50 latency is 10ms and P99 latency is 8 seconds. Approximately how many users per second experience the slow response?
Rule of thumb: optimize for P99, not P50. Your average user experience doesn't determine your reputation; instead, your worst 1% does.
Why P99 Matters More Than You Think
At scale, percentiles compound. Amazon's research showed that a page with 100 service calls, where each has a 99.9% success rate, has only a 90% chance of all calls succeeding. Similarly, if each call has a 1% chance of being slow, a page with 100 calls has nearly a 100% chance of at least one slow call.
This is the long-tail problem of distributed systems. High-percentile latency (P99, P99.9) affects nearly every user in complex systems even if each individual service looks healthy.
Why does P99 latency matter more than P50 latency in complex distributed systems?
Real Examples
Database reads: A PostgreSQL query might show P50 = 2ms, P99 = 150ms. The difference is usually index usage, as some queries hit edge cases that require full scans or lock waits.
API responses: A typical web API might show P50 = 50ms, P99 = 800ms. The tail is often caused by garbage collection pauses in the JVM/Node.js, cold connections to downstream services, or request queueing during traffic spikes.
Network hops: A CDN edge node 20ms from a user beats an origin server 200ms away for every percentile. Geography matters.
Putting It Together
| Metric | Question it answers | Improved by |
|---|---|---|
| Bandwidth | What's the maximum data rate possible? | Better hardware, more network links |
| Throughput | How much work is the system actually doing? | Parallelism, caching, batching |
| Latency (P50) | What's the typical user experience? | Faster algorithms, caching, proximity |
| Latency (P99) | What's the worst common experience? | Eliminating outliers: GC tuning, timeouts, retry budgets |
Summary
Bandwidth is the capacity ceiling, throughput is what you actually achieve, and latency is how fast individual operations complete. The highway analogy holds: bandwidth is lanes, throughput is cars passing through, latency is travel time. Latency and throughput often trade off; batching increases throughput but raises latency. Most importantly: measure latency with percentiles, not averages. P99 latency is what your worst 1% of users experience, and at scale, 1% is a huge number of people. Optimize for the tail.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices