Rate Limiting

Updated June 3, 2026
M
Magic Magnets Team
10 min read

In 2016, the GitHub API went down for several hours. The cause wasn't a hardware failure or a bad deploy — it was a single client that was hammering the API with requests far beyond what their system was designed for. Without sufficient rate limiting at the right layers, one misbehaving actor can take down the whole system.

Rate limiting is how you prevent that.

Why Rate Limiting Exists

Rate limiting controls how many requests a client can make in a given time window. The most direct motivation is infrastructure protection: a single misbehaving client shouldn't be able to exhaust server resources for everyone else. Related to this is fair usage — the "noisy neighbor" problem, where one tenant's traffic degrades service quality for all others.

Beyond protection, there are operational concerns. If your service calls expensive third-party APIs downstream, a buggy client loop can burn through your budget in minutes. Rate limiting keeps costs bounded. At the security layer, brute-force attacks against login endpoints depend on the ability to fire thousands of requests rapidly; a tight rate limit on authentication routes makes this class of attack impractical.

Finally, rate limiting is the mechanism behind API monetization. Enforcing usage tiers — free plans at 100 requests/day, paid plans at 10,000 — requires knowing and enforcing per-client limits.

When a client exceeds their limit, you return HTTP 429 Too Many Requests with a Retry-After header indicating when they can try again.

Quiz Time

What HTTP status code should a server return when a client exceeds its rate limit?

The Algorithms

There are four main algorithms, each with different tradeoff profiles.

algobase.dev
Token bucket is the most widely used rate limiting algorithm. Each client gets a bucket with a maximum capacity (e.g. 100 tokens). Tokens are added at a fixed refill rate (e.g. 10 tokens/second). Each request consumes one token. If the bucket has tokens, the request goes through. If the bucket is empty, the request is rejected with HTTP 429 Too Many Requests and a Retry-After header. The key property: bursts are allowed. A client that has been idle for 10 seconds has a full bucket and can immediately send 100 requests — the burst is absorbed and then the refill rate enforces the average. AWS API Gateway and most Redis-based rate limiters use token bucket.
1 / 1

Token bucket — 10 tokens refilled per second, each request consumes one, empty bucket = rejected

Token Bucket

This is the most widely used algorithm in production systems. Here's the mental model:

Imagine a bucket with a maximum capacity of 100 tokens. Tokens are added to the bucket at a fixed rate — say, 10 tokens per second. Each request consumes one token. If the bucket is empty when a request arrives, the request is rejected (or queued). If the bucket has tokens, the request goes through.

  • Handles bursts gracefully — a client that hasn't made requests in a while has a full bucket and can make many requests rapidly
  • Enforces an average rate over time (the refill rate)
  • Simple to implement and reason about

AWS API Gateway's default throttling and most Redis-based rate limiters use a token bucket variant.

Quiz Time

The Fixed Window Counter algorithm is vulnerable to boundary exploitation.

Key: ratelimit:user:123 Tokens remaining: 87 Last refill: 2024-01-15T10:30:00Z

Leaky Bucket

Think of a bucket with a small hole at the bottom. Requests enter the bucket from the top (queued). They "leak" out the bottom at a fixed rate and get processed.

If requests arrive faster than the leak rate, the bucket fills up. Once full, new requests overflow — they're dropped.

  • Enforces a perfectly smooth output rate — no bursts, ever
  • Good for protecting downstream systems that can't handle spikes
  • Queue depth provides natural backpressure

The downside: If there's a sudden burst of legitimate traffic (e.g., 10,000 users all opening the app at 9am), leaky bucket will queue them and process them at its fixed rate, adding latency to all of them. Token bucket would handle the burst immediately.

Quiz Time

Which rate limiting algorithm enforces a perfectly smooth output rate but struggles with sudden legitimate bursts?

Fixed Window Counter

Divide time into fixed windows (e.g., each 1-minute interval). Count requests in the current window. If the count exceeds the limit, reject the request.

Window: 2024-01-15 10:30:00 → 10:31:00 Count: 847 / 1000
  • Dead simple to implement
  • Very low memory — just a counter per client per window

The classic problem: Boundary exploitation. If the limit is 100 requests/minute, a client can make 100 requests at 10:59:59 and another 100 at 11:00:01. That is 200 requests in a 2-second window while technically staying within the "per-minute" limit. This is real and has been exploited.

Sliding Window Log

Keep a log of timestamps for each request. When a new request arrives, remove timestamps older than the window (e.g., > 1 minute ago). If the remaining count is below the limit, allow the request.

  • Perfectly accurate — no boundary exploitation possible
  • Memory-heavy — you're storing individual timestamps, not just a counter

For most APIs, the memory cost makes this impractical at scale. You might store tens of millions of timestamps for a large API with many clients.

Sliding Window Counter

The practical compromise between Fixed Window and Sliding Window Log. Uses two counters: the current window count and the previous window count. Estimates the rolling window count as:

estimated_count = prev_count × (1 - elapsed_fraction) + curr_count
  • Good accuracy — eliminates most of the boundary exploitation problem
  • Low memory — just two counters per client
  • Used by Cloudflare and Redis's built-in rate limiting modules

This is generally the best practical choice for most rate limiting use cases.

Comparison Table

AlgorithmHandles BurstsMemoryAccuracyComplexity
Token BucketYesLowGoodLow
Leaky BucketNoMediumPerfectLow
Fixed WindowYesVery LowPoor (boundary issue)Very Low
Sliding Window LogYesHighPerfectMedium
Sliding Window CounterYesLowGoodMedium

Where to Implement Rate Limiting

Rate limiting can live at multiple layers of your stack, and often should.

Client-Side

Rate limit yourself. If you're calling a third-party API, implement backoff logic in your client so you don't accidentally burn through your quota. Exponential backoff with jitter is the standard pattern.

This doesn't protect you from external clients — it's purely defensive programming for when you're the client.

API Gateway Level

The most common place to implement inbound rate limiting. The gateway sits at the edge of your system and sees all requests before they reach any backend service. This is where you enforce per-API-key and per-IP limits.

Products like Kong, AWS API Gateway, and Nginx all have rate limiting built in. For simple cases, this is all you need.

Service Level

Microservices can also enforce their own rate limits, independent of the gateway. This is useful for:

  • Protecting against internal service-to-service abuse (not just external clients)
  • Enforcing limits specific to a single service that the gateway doesn't know about
  • Defense in depth — even if the gateway misconfigures a limit, the service holds its own line
algobase.dev
Local rate limiting breaks when you have multiple gateway instances behind a load balancer. Client A could send 100 requests to Instance 1 and 100 to Instance 2, bypassing a "100 req/min" limit since each instance only counted 100. The fix is a shared counter in Redis. Every gateway instance increments the same atomic Redis key (INCR ratelimit:user:123:window:TIMESTAMP) and checks the result. Redis's atomic increment prevents race conditions. This is how Stripe and GitHub implement distributed rate limiting. The tradeoff: every API request now includes a Redis round-trip (~1ms), so Redis must be highly available — if it goes down and you fail-open, your protection disappears.
1 / 1

Distributed rate limiting — multiple gateway instances sharing a Redis counter with atomic INCR

Distributed Rate Limiting

Here's the tricky part: if you have 10 instances of your API gateway running behind a load balancer, each instance has its own memory. A client could hit instance 1 with 100 requests and instance 2 with 100 requests, getting 200 through while each instance thinks they only served 100.

The solution is a centralized store — typically Redis. Instead of storing rate limit counters in local memory, every gateway instance reads and writes to Redis.

INCR ratelimit:user:123:window:1705315200 EXPIRE ratelimit:user:123:window:1705315200 60

Redis's atomic increment operations (INCR) make this race-condition-safe. This is how Stripe, GitHub, and most large-scale APIs implement distributed rate limiting.

The tradeoff: A Redis call adds latency to every single API request. This is typically acceptable (Redis round-trips are sub-millisecond in the same data center), but it's worth designing for. Use connection pooling, and make sure Redis is highly available — if the rate limiter goes down and you fail-open, you've lost your protection.

HTTP 429 and Client Communication

When you reject a request due to rate limiting, give the client enough information to recover gracefully:

HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1705315260 Retry-After: 47 Content-Type: application/json { "error": "rate_limit_exceeded", "message": "You have exceeded the 1000 requests/hour limit.", "retry_after": 47 }

The X-RateLimit-* headers should be included on every response, not just 429s — clients can proactively throttle themselves before hitting the limit.

Summary

Rate limiting is a fundamental reliability and security tool. It protects your system from both malicious and accidental abuse.

Token bucket is the most common algorithm in production — it handles bursts while enforcing an average rate over time. For most use cases, sliding window counter offers the best balance of accuracy and memory efficiency. Implement rate limiting at the API gateway for external traffic; add service-level limits where you need defense in depth. For distributed deployments, Redis gives you a shared, atomic counter across all gateway instances. Whenever you reject a request, return HTTP 429 with enough header information for clients to back off and retry intelligently.

The goal of rate limiting isn't to punish clients — it's to protect your system so it stays available for everyone.

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices