Rate Limiting
Updated June 3, 2026In 2016, the GitHub API went down for several hours. The cause wasn't a hardware failure or a bad deploy — it was a single client that was hammering the API with requests far beyond what their system was designed for. Without sufficient rate limiting at the right layers, one misbehaving actor can take down the whole system.
Rate limiting is how you prevent that.
Why Rate Limiting Exists
Rate limiting controls how many requests a client can make in a given time window. The most direct motivation is infrastructure protection: a single misbehaving client shouldn't be able to exhaust server resources for everyone else. Related to this is fair usage — the "noisy neighbor" problem, where one tenant's traffic degrades service quality for all others.
Beyond protection, there are operational concerns. If your service calls expensive third-party APIs downstream, a buggy client loop can burn through your budget in minutes. Rate limiting keeps costs bounded. At the security layer, brute-force attacks against login endpoints depend on the ability to fire thousands of requests rapidly; a tight rate limit on authentication routes makes this class of attack impractical.
Finally, rate limiting is the mechanism behind API monetization. Enforcing usage tiers — free plans at 100 requests/day, paid plans at 10,000 — requires knowing and enforcing per-client limits.
When a client exceeds their limit, you return HTTP 429 Too Many Requests with a Retry-After header indicating when they can try again.
What HTTP status code should a server return when a client exceeds its rate limit?
The Algorithms
There are four main algorithms, each with different tradeoff profiles.
Token bucket — 10 tokens refilled per second, each request consumes one, empty bucket = rejected
Token Bucket
This is the most widely used algorithm in production systems. Here's the mental model:
Imagine a bucket with a maximum capacity of 100 tokens. Tokens are added to the bucket at a fixed rate — say, 10 tokens per second. Each request consumes one token. If the bucket is empty when a request arrives, the request is rejected (or queued). If the bucket has tokens, the request goes through.
- Handles bursts gracefully — a client that hasn't made requests in a while has a full bucket and can make many requests rapidly
- Enforces an average rate over time (the refill rate)
- Simple to implement and reason about
AWS API Gateway's default throttling and most Redis-based rate limiters use a token bucket variant.
The Fixed Window Counter algorithm is vulnerable to boundary exploitation.
Key: ratelimit:user:123
Tokens remaining: 87
Last refill: 2024-01-15T10:30:00ZLeaky Bucket
Think of a bucket with a small hole at the bottom. Requests enter the bucket from the top (queued). They "leak" out the bottom at a fixed rate and get processed.
If requests arrive faster than the leak rate, the bucket fills up. Once full, new requests overflow — they're dropped.
- Enforces a perfectly smooth output rate — no bursts, ever
- Good for protecting downstream systems that can't handle spikes
- Queue depth provides natural backpressure
The downside: If there's a sudden burst of legitimate traffic (e.g., 10,000 users all opening the app at 9am), leaky bucket will queue them and process them at its fixed rate, adding latency to all of them. Token bucket would handle the burst immediately.
Which rate limiting algorithm enforces a perfectly smooth output rate but struggles with sudden legitimate bursts?
Fixed Window Counter
Divide time into fixed windows (e.g., each 1-minute interval). Count requests in the current window. If the count exceeds the limit, reject the request.
Window: 2024-01-15 10:30:00 → 10:31:00
Count: 847 / 1000- Dead simple to implement
- Very low memory — just a counter per client per window
The classic problem: Boundary exploitation. If the limit is 100 requests/minute, a client can make 100 requests at 10:59:59 and another 100 at 11:00:01. That is 200 requests in a 2-second window while technically staying within the "per-minute" limit. This is real and has been exploited.
Sliding Window Log
Keep a log of timestamps for each request. When a new request arrives, remove timestamps older than the window (e.g., > 1 minute ago). If the remaining count is below the limit, allow the request.
- Perfectly accurate — no boundary exploitation possible
- Memory-heavy — you're storing individual timestamps, not just a counter
For most APIs, the memory cost makes this impractical at scale. You might store tens of millions of timestamps for a large API with many clients.
Sliding Window Counter
The practical compromise between Fixed Window and Sliding Window Log. Uses two counters: the current window count and the previous window count. Estimates the rolling window count as:
estimated_count = prev_count × (1 - elapsed_fraction) + curr_count- Good accuracy — eliminates most of the boundary exploitation problem
- Low memory — just two counters per client
- Used by Cloudflare and Redis's built-in rate limiting modules
This is generally the best practical choice for most rate limiting use cases.
Comparison Table
| Algorithm | Handles Bursts | Memory | Accuracy | Complexity |
|---|---|---|---|---|
| Token Bucket | Yes | Low | Good | Low |
| Leaky Bucket | No | Medium | Perfect | Low |
| Fixed Window | Yes | Very Low | Poor (boundary issue) | Very Low |
| Sliding Window Log | Yes | High | Perfect | Medium |
| Sliding Window Counter | Yes | Low | Good | Medium |
Where to Implement Rate Limiting
Rate limiting can live at multiple layers of your stack, and often should.
Client-Side
Rate limit yourself. If you're calling a third-party API, implement backoff logic in your client so you don't accidentally burn through your quota. Exponential backoff with jitter is the standard pattern.
This doesn't protect you from external clients — it's purely defensive programming for when you're the client.
API Gateway Level
The most common place to implement inbound rate limiting. The gateway sits at the edge of your system and sees all requests before they reach any backend service. This is where you enforce per-API-key and per-IP limits.
Products like Kong, AWS API Gateway, and Nginx all have rate limiting built in. For simple cases, this is all you need.
Service Level
Microservices can also enforce their own rate limits, independent of the gateway. This is useful for:
- Protecting against internal service-to-service abuse (not just external clients)
- Enforcing limits specific to a single service that the gateway doesn't know about
- Defense in depth — even if the gateway misconfigures a limit, the service holds its own line
Distributed rate limiting — multiple gateway instances sharing a Redis counter with atomic INCR
Distributed Rate Limiting
Here's the tricky part: if you have 10 instances of your API gateway running behind a load balancer, each instance has its own memory. A client could hit instance 1 with 100 requests and instance 2 with 100 requests, getting 200 through while each instance thinks they only served 100.
The solution is a centralized store — typically Redis. Instead of storing rate limit counters in local memory, every gateway instance reads and writes to Redis.
INCR ratelimit:user:123:window:1705315200
EXPIRE ratelimit:user:123:window:1705315200 60Redis's atomic increment operations (INCR) make this race-condition-safe. This is how Stripe, GitHub, and most large-scale APIs implement distributed rate limiting.
The tradeoff: A Redis call adds latency to every single API request. This is typically acceptable (Redis round-trips are sub-millisecond in the same data center), but it's worth designing for. Use connection pooling, and make sure Redis is highly available — if the rate limiter goes down and you fail-open, you've lost your protection.
HTTP 429 and Client Communication
When you reject a request due to rate limiting, give the client enough information to recover gracefully:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705315260
Retry-After: 47
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "You have exceeded the 1000 requests/hour limit.",
"retry_after": 47
}The X-RateLimit-* headers should be included on every response, not just 429s — clients can proactively throttle themselves before hitting the limit.
Summary
Rate limiting is a fundamental reliability and security tool. It protects your system from both malicious and accidental abuse.
Token bucket is the most common algorithm in production — it handles bursts while enforcing an average rate over time. For most use cases, sliding window counter offers the best balance of accuracy and memory efficiency. Implement rate limiting at the API gateway for external traffic; add service-level limits where you need defense in depth. For distributed deployments, Redis gives you a shared, atomic counter across all gateway instances. Whenever you reject a request, return HTTP 429 with enough header information for clients to back off and retry intelligently.
The goal of rate limiting isn't to punish clients — it's to protect your system so it stays available for everyone.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices