20.11 Building a Rate Limiter
Design a rate limiter that controls how often a client can perform an operation.
20.11 Building a Rate Limiter
Problem
Design a rate limiter that controls how often a client can perform an operation.
Rate limiters appear in APIs, login systems, payment services, message queues, search endpoints, web crawlers, background workers, and distributed systems. They protect services from overload, enforce fair usage, reduce abuse, and smooth traffic spikes.
A rate limiter answers one question:
Should this request be allowed right now?
Example policy:
Allow at most 100 requests per user per minute.
If user u1 sends 100 requests in the current minute, those requests are accepted. The 101st request is rejected or delayed.
The algorithmic problem is counting events over time. The engineering problem is choosing a time model that matches the desired behavior.
Solution
Start with a fixed-window counter. Store a counter for each client and reset it at fixed time boundaries.
import time
from dataclasses import dataclass
@dataclass
class Decision:
allowed: bool
remaining: int
retry_after: float
A fixed-window limiter:
class FixedWindowRateLimiter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.counters = {}
def allow(self, key, now=None):
if now is None:
now = time.time()
window_id = int(now // self.window_seconds)
state_key = (key, window_id)
count = self.counters.get(state_key, 0)
if count >= self.limit:
window_end = (window_id + 1) * self.window_seconds
return Decision(
allowed=False,
remaining=0,
retry_after=window_end - now,
)
count += 1
self.counters[state_key] = count
return Decision(
allowed=True,
remaining=self.limit - count,
retry_after=0.0,
)
Use it:
limiter = FixedWindowRateLimiter(
limit=100,
window_seconds=60,
)
decision = limiter.allow("user:42")
if decision.allowed:
handle_request()
else:
reject_request(retry_after=decision.retry_after)
This implementation is simple and fast. It is also imperfect.
Fixed Window Boundary Problem
Fixed windows can allow bursts at boundaries.
Policy:
100 requests per minute
A user sends:
100 requests at 12:00:59
100 requests at 12:01:00
The limiter accepts 200 requests in roughly one second because the counter resets at the minute boundary.
This may be acceptable for administrative APIs or low-risk endpoints. It is usually too loose for login attempts, payment operations, and high-traffic public APIs.
Sliding Window Log
A sliding-window log stores timestamps of recent requests.
For each request:
- Remove timestamps older than the window.
- Count remaining timestamps.
- Accept if the count is below the limit.
- Add the current timestamp.
from collections import defaultdict, deque
class SlidingWindowLogRateLimiter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.events = defaultdict(deque)
def allow(self, key, now=None):
if now is None:
now = time.time()
queue = self.events[key]
cutoff = now - self.window_seconds
while queue and queue[0] <= cutoff:
queue.popleft()
if len(queue) >= self.limit:
retry_after = queue[0] + self.window_seconds - now
return Decision(
allowed=False,
remaining=0,
retry_after=retry_after,
)
queue.append(now)
return Decision(
allowed=True,
remaining=self.limit - len(queue),
retry_after=0.0,
)
This enforces the policy precisely over any rolling interval.
Complexity:
| Operation | Complexity |
|---|---|
| Allow request | O(expired events removed) |
| Memory per key | O(limit) |
The memory cost is bounded by the number of accepted events in the window.
Sliding Window Counter
A sliding-window counter approximates a rolling window with two fixed buckets: the current window and the previous window.
Estimate current usage by weighting the previous bucket according to overlap.
Suppose the window is 60 seconds and the current window is 25 percent complete. Then 75 percent of the previous window still overlaps the last 60 seconds.
class SlidingWindowCounterRateLimiter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.state = {}
def allow(self, key, now=None):
if now is None:
now = time.time()
window_id = int(now // self.window_seconds)
elapsed = now % self.window_seconds
previous_weight = (
self.window_seconds - elapsed
) / self.window_seconds
current_key = (key, window_id)
previous_key = (key, window_id - 1)
current_count = self.state.get(current_key, 0)
previous_count = self.state.get(previous_key, 0)
estimated = current_count + previous_count * previous_weight
if estimated >= self.limit:
retry_after = self.window_seconds - elapsed
return Decision(
allowed=False,
remaining=0,
retry_after=retry_after,
)
current_count += 1
self.state[current_key] = current_count
remaining = max(
0,
int(self.limit - estimated - 1),
)
return Decision(
allowed=True,
remaining=remaining,
retry_after=0.0,
)
This uses less memory than the sliding-window log, but it is approximate.
Token Bucket
The token bucket algorithm allows controlled bursts while enforcing an average rate.
The bucket contains tokens. Each request consumes one token. Tokens refill over time up to a maximum capacity.
Example policy:
Refill 10 tokens per second.
Bucket capacity 100.
Each request costs 1 token.
This allows a burst of 100 requests if the client has been idle, then limits sustained traffic to 10 requests per second.
class TokenBucketRateLimiter:
def __init__(self, capacity, refill_rate_per_second):
self.capacity = capacity
self.refill_rate = refill_rate_per_second
self.state = {}
def allow(self, key, cost=1.0, now=None):
if now is None:
now = time.time()
tokens, last_refill = self.state.get(
key,
(self.capacity, now),
)
elapsed = now - last_refill
tokens = min(
self.capacity,
tokens + elapsed * self.refill_rate,
)
if tokens < cost:
needed = cost - tokens
retry_after = needed / self.refill_rate
self.state[key] = (tokens, now)
return Decision(
allowed=False,
remaining=int(tokens),
retry_after=retry_after,
)
tokens -= cost
self.state[key] = (tokens, now)
return Decision(
allowed=True,
remaining=int(tokens),
retry_after=0.0,
)
Use it:
limiter = TokenBucketRateLimiter(
capacity=100,
refill_rate_per_second=10,
)
decision = limiter.allow("api_key:abc")
Token bucket is one of the most useful general-purpose rate-limiting algorithms.
Leaky Bucket
The leaky bucket algorithm smooths traffic into a steady output rate. Requests enter a queue. The queue drains at a fixed rate. If the queue is full, new requests are rejected.
Token bucket controls admission with burst capacity. Leaky bucket controls processing rate with queue capacity.
A simple admission model:
class LeakyBucketRateLimiter:
def __init__(self, capacity, leak_rate_per_second):
self.capacity = capacity
self.leak_rate = leak_rate_per_second
self.state = {}
def allow(self, key, now=None):
if now is None:
now = time.time()
level, last_check = self.state.get(key, (0.0, now))
elapsed = now - last_check
level = max(0.0, level - elapsed * self.leak_rate)
if level + 1 > self.capacity:
retry_after = (level + 1 - self.capacity) / self.leak_rate
self.state[key] = (level, now)
return Decision(
allowed=False,
remaining=0,
retry_after=retry_after,
)
level += 1
self.state[key] = (level, now)
return Decision(
allowed=True,
remaining=int(self.capacity - level),
retry_after=0.0,
)
Use leaky bucket when smoothing matters more than burst tolerance.
Choosing an Algorithm
Different rate limiters enforce different traffic shapes.
| Algorithm | Strength | Weakness |
|---|---|---|
| Fixed window | Simple and fast | Boundary bursts |
| Sliding log | Precise rolling window | More memory |
| Sliding counter | Low memory rolling approximation | Approximate |
| Token bucket | Allows bursts and controls average rate | More state |
| Leaky bucket | Smooths output rate | May add queueing semantics |
For most APIs, token bucket is a good default. For security-sensitive limits such as login attempts, use sliding-window log or a durable counter. For simple dashboards and internal tools, fixed window may be enough.
Per-Key Limits
A limiter usually applies to a key.
Common keys:
user_id
api_key
ip_address
organization_id
route
tenant
Composite keys are often necessary.
def rate_limit_key(request):
return (
f"tenant:{request.tenant_id}:"
f"user:{request.user_id}:"
f"route:{request.route_name}"
)
This lets you enforce policies such as:
100 requests per user per minute on /search
1000 requests per tenant per minute globally
10 login attempts per IP per hour
A real system often applies multiple limits to the same request.
Multiple Limits
Apply stricter and broader policies together.
def check_limits(request, limiters):
keys = [
("user", f"user:{request.user_id}"),
("tenant", f"tenant:{request.tenant_id}"),
("route", f"route:{request.route_name}:user:{request.user_id}"),
]
decisions = []
for name, key in keys:
decision = limiters[name].allow(key)
decisions.append(decision)
if not decision.allowed:
return decision
return min(
decisions,
key=lambda decision: decision.remaining,
)
Be careful with side effects. If the first limiter accepts and the second rejects, the first has already consumed capacity. For strict consistency, use a reservation or rollback pattern, or check all limits in a shared atomic store.
Distributed Rate Limiting
In a multi-server environment, local memory is not enough.
If each of 10 servers allows 100 requests per minute, the client may receive:
1000 requests per minute
instead of 100.
Use shared state:
- Redis
- Memcached
- Database counters
- Distributed key-value store
- API gateway
- Service mesh
Redis is a common choice because it supports atomic increments, expiration, and Lua scripts.
A fixed-window Redis pattern:
INCR key
EXPIRE key window_seconds
reject if count > limit
The increment and expiration should be atomic or carefully ordered. Lua scripts are commonly used to avoid race conditions.
Cleanup and Memory
In-memory limiters must expire old state.
Fixed-window counters can remove old windows.
def cleanup_fixed_window(counters, current_window_id):
old_keys = [
state_key
for state_key in counters
if state_key[1] < current_window_id - 1
]
for state_key in old_keys:
del counters[state_key]
Sliding-window logs naturally remove old timestamps when each key is checked. But inactive keys may remain forever. Periodic cleanup is still useful.
Token buckets should remove idle keys after a retention period. If a bucket is full and inactive, it carries no useful state.
Response Headers
HTTP APIs often expose rate-limit status.
Typical headers:
RateLimit-Limit: 100
RateLimit-Remaining: 42
RateLimit-Reset: 17
Retry-After: 17
Response behavior:
def rate_limited_response(decision):
return {
"status": 429,
"headers": {
"Retry-After": str(int(decision.retry_after)),
},
"body": {
"error": "rate_limited",
"retry_after": decision.retry_after,
},
}
Expose enough information for clients to back off correctly, but avoid leaking details that help attackers optimize abuse.
Testing
Rate limiters should use a fake clock. Real sleeping makes tests slow and flaky.
class FakeClock:
def __init__(self, now=0.0):
self.now = now
def time(self):
return self.now
def advance(self, seconds):
self.now += seconds
Test fixed window:
def test_fixed_window_rejects_after_limit():
limiter = FixedWindowRateLimiter(
limit=2,
window_seconds=60,
)
assert limiter.allow("u1", now=0).allowed is True
assert limiter.allow("u1", now=1).allowed is True
assert limiter.allow("u1", now=2).allowed is False
Test token refill:
def test_token_bucket_refills():
limiter = TokenBucketRateLimiter(
capacity=2,
refill_rate_per_second=1,
)
assert limiter.allow("u1", now=0).allowed is True
assert limiter.allow("u1", now=0).allowed is True
assert limiter.allow("u1", now=0).allowed is False
assert limiter.allow("u1", now=1).allowed is True
Test key isolation:
def test_keys_are_isolated():
limiter = FixedWindowRateLimiter(
limit=1,
window_seconds=60,
)
assert limiter.allow("u1", now=0).allowed is True
assert limiter.allow("u2", now=0).allowed is True
assert limiter.allow("u1", now=0).allowed is False
Common Bugs
The most common bug is using local process memory in a horizontally scaled service and assuming the limit is global.
Another common bug is fixed-window boundary bursts. The implementation is correct but the policy is weaker than expected.
Time bugs appear when wall-clock time moves backward. For internal refill calculations, monotonic clocks are safer.
Concurrency bugs appear when multiple requests update the same key at the same time. Use locks locally and atomic operations in shared stores.
Composite-key bugs apply a limit too broadly or too narrowly. A per-IP login limit and a per-user API limit should not share the same key.
Memory leaks appear when inactive keys are never removed.
Multi-limit bugs consume one limiter and then reject on a later limiter, reducing capacity for a request that never ran.
Recipe
Build the limiter in layers.
Start with a fixed-window limiter to define the interface. Add a sliding-window log when precise rolling limits matter. Use token bucket for general API traffic because it supports bursts and controls sustained rate. Define keys explicitly. Apply multiple limits when needed. Use shared atomic state for distributed services. Return clear retry information. Test with a fake clock. Add cleanup for inactive keys.
The main lesson is that rate limiting is not only counting. It is a policy encoded as a time-based data structure. The chosen algorithm determines whether traffic is bursty, smooth, precise, approximate, local, or globally enforced.