20.11 Building a Rate Limiter

Design a rate limiter that controls how often a client can perform an operation.

20.11 Building a Rate Limiter

Problem

Design a rate limiter that controls how often a client can perform an operation.

Rate limiters appear in APIs, login systems, payment services, message queues, search endpoints, web crawlers, background workers, and distributed systems. They protect services from overload, enforce fair usage, reduce abuse, and smooth traffic spikes.

A rate limiter answers one question:

Should this request be allowed right now?

Example policy:

Allow at most 100 requests per user per minute.

If user u1 sends 100 requests in the current minute, those requests are accepted. The 101st request is rejected or delayed.

The algorithmic problem is counting events over time. The engineering problem is choosing a time model that matches the desired behavior.

Solution

Start with a fixed-window counter. Store a counter for each client and reset it at fixed time boundaries.

import time
from dataclasses import dataclass

@dataclass
class Decision:
    allowed: bool
    remaining: int
    retry_after: float

A fixed-window limiter:

class FixedWindowRateLimiter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters = {}

    def allow(self, key, now=None):
        if now is None:
            now = time.time()

        window_id = int(now // self.window_seconds)
        state_key = (key, window_id)

        count = self.counters.get(state_key, 0)

        if count >= self.limit:
            window_end = (window_id + 1) * self.window_seconds

            return Decision(
                allowed=False,
                remaining=0,
                retry_after=window_end - now,
            )

        count += 1
        self.counters[state_key] = count

        return Decision(
            allowed=True,
            remaining=self.limit - count,
            retry_after=0.0,
        )

Use it:

limiter = FixedWindowRateLimiter(
    limit=100,
    window_seconds=60,
)

decision = limiter.allow("user:42")

if decision.allowed:
    handle_request()
else:
    reject_request(retry_after=decision.retry_after)

This implementation is simple and fast. It is also imperfect.

Fixed Window Boundary Problem

Fixed windows can allow bursts at boundaries.

Policy:

100 requests per minute

A user sends:

100 requests at 12:00:59
100 requests at 12:01:00

The limiter accepts 200 requests in roughly one second because the counter resets at the minute boundary.

This may be acceptable for administrative APIs or low-risk endpoints. It is usually too loose for login attempts, payment operations, and high-traffic public APIs.

Sliding Window Log

A sliding-window log stores timestamps of recent requests.

For each request:

  1. Remove timestamps older than the window.
  2. Count remaining timestamps.
  3. Accept if the count is below the limit.
  4. Add the current timestamp.
from collections import defaultdict, deque

class SlidingWindowLogRateLimiter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.events = defaultdict(deque)

    def allow(self, key, now=None):
        if now is None:
            now = time.time()

        queue = self.events[key]
        cutoff = now - self.window_seconds

        while queue and queue[0] <= cutoff:
            queue.popleft()

        if len(queue) >= self.limit:
            retry_after = queue[0] + self.window_seconds - now

            return Decision(
                allowed=False,
                remaining=0,
                retry_after=retry_after,
            )

        queue.append(now)

        return Decision(
            allowed=True,
            remaining=self.limit - len(queue),
            retry_after=0.0,
        )

This enforces the policy precisely over any rolling interval.

Complexity:

Operation Complexity
Allow request O(expired events removed)
Memory per key O(limit)

The memory cost is bounded by the number of accepted events in the window.

Sliding Window Counter

A sliding-window counter approximates a rolling window with two fixed buckets: the current window and the previous window.

Estimate current usage by weighting the previous bucket according to overlap.

Suppose the window is 60 seconds and the current window is 25 percent complete. Then 75 percent of the previous window still overlaps the last 60 seconds.

class SlidingWindowCounterRateLimiter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.state = {}

    def allow(self, key, now=None):
        if now is None:
            now = time.time()

        window_id = int(now // self.window_seconds)
        elapsed = now % self.window_seconds
        previous_weight = (
            self.window_seconds - elapsed
        ) / self.window_seconds

        current_key = (key, window_id)
        previous_key = (key, window_id - 1)

        current_count = self.state.get(current_key, 0)
        previous_count = self.state.get(previous_key, 0)

        estimated = current_count + previous_count * previous_weight

        if estimated >= self.limit:
            retry_after = self.window_seconds - elapsed

            return Decision(
                allowed=False,
                remaining=0,
                retry_after=retry_after,
            )

        current_count += 1
        self.state[current_key] = current_count

        remaining = max(
            0,
            int(self.limit - estimated - 1),
        )

        return Decision(
            allowed=True,
            remaining=remaining,
            retry_after=0.0,
        )

This uses less memory than the sliding-window log, but it is approximate.

Token Bucket

The token bucket algorithm allows controlled bursts while enforcing an average rate.

The bucket contains tokens. Each request consumes one token. Tokens refill over time up to a maximum capacity.

Example policy:

Refill 10 tokens per second.
Bucket capacity 100.
Each request costs 1 token.

This allows a burst of 100 requests if the client has been idle, then limits sustained traffic to 10 requests per second.

class TokenBucketRateLimiter:
    def __init__(self, capacity, refill_rate_per_second):
        self.capacity = capacity
        self.refill_rate = refill_rate_per_second
        self.state = {}

    def allow(self, key, cost=1.0, now=None):
        if now is None:
            now = time.time()

        tokens, last_refill = self.state.get(
            key,
            (self.capacity, now),
        )

        elapsed = now - last_refill
        tokens = min(
            self.capacity,
            tokens + elapsed * self.refill_rate,
        )

        if tokens < cost:
            needed = cost - tokens
            retry_after = needed / self.refill_rate

            self.state[key] = (tokens, now)

            return Decision(
                allowed=False,
                remaining=int(tokens),
                retry_after=retry_after,
            )

        tokens -= cost
        self.state[key] = (tokens, now)

        return Decision(
            allowed=True,
            remaining=int(tokens),
            retry_after=0.0,
        )

Use it:

limiter = TokenBucketRateLimiter(
    capacity=100,
    refill_rate_per_second=10,
)

decision = limiter.allow("api_key:abc")

Token bucket is one of the most useful general-purpose rate-limiting algorithms.

Leaky Bucket

The leaky bucket algorithm smooths traffic into a steady output rate. Requests enter a queue. The queue drains at a fixed rate. If the queue is full, new requests are rejected.

Token bucket controls admission with burst capacity. Leaky bucket controls processing rate with queue capacity.

A simple admission model:

class LeakyBucketRateLimiter:
    def __init__(self, capacity, leak_rate_per_second):
        self.capacity = capacity
        self.leak_rate = leak_rate_per_second
        self.state = {}

    def allow(self, key, now=None):
        if now is None:
            now = time.time()

        level, last_check = self.state.get(key, (0.0, now))

        elapsed = now - last_check
        level = max(0.0, level - elapsed * self.leak_rate)

        if level + 1 > self.capacity:
            retry_after = (level + 1 - self.capacity) / self.leak_rate

            self.state[key] = (level, now)

            return Decision(
                allowed=False,
                remaining=0,
                retry_after=retry_after,
            )

        level += 1
        self.state[key] = (level, now)

        return Decision(
            allowed=True,
            remaining=int(self.capacity - level),
            retry_after=0.0,
        )

Use leaky bucket when smoothing matters more than burst tolerance.

Choosing an Algorithm

Different rate limiters enforce different traffic shapes.

Algorithm Strength Weakness
Fixed window Simple and fast Boundary bursts
Sliding log Precise rolling window More memory
Sliding counter Low memory rolling approximation Approximate
Token bucket Allows bursts and controls average rate More state
Leaky bucket Smooths output rate May add queueing semantics

For most APIs, token bucket is a good default. For security-sensitive limits such as login attempts, use sliding-window log or a durable counter. For simple dashboards and internal tools, fixed window may be enough.

Per-Key Limits

A limiter usually applies to a key.

Common keys:

user_id
api_key
ip_address
organization_id
route
tenant

Composite keys are often necessary.

def rate_limit_key(request):
    return (
        f"tenant:{request.tenant_id}:"
        f"user:{request.user_id}:"
        f"route:{request.route_name}"
    )

This lets you enforce policies such as:

100 requests per user per minute on /search
1000 requests per tenant per minute globally
10 login attempts per IP per hour

A real system often applies multiple limits to the same request.

Multiple Limits

Apply stricter and broader policies together.

def check_limits(request, limiters):
    keys = [
        ("user", f"user:{request.user_id}"),
        ("tenant", f"tenant:{request.tenant_id}"),
        ("route", f"route:{request.route_name}:user:{request.user_id}"),
    ]

    decisions = []

    for name, key in keys:
        decision = limiters[name].allow(key)
        decisions.append(decision)

        if not decision.allowed:
            return decision

    return min(
        decisions,
        key=lambda decision: decision.remaining,
    )

Be careful with side effects. If the first limiter accepts and the second rejects, the first has already consumed capacity. For strict consistency, use a reservation or rollback pattern, or check all limits in a shared atomic store.

Distributed Rate Limiting

In a multi-server environment, local memory is not enough.

If each of 10 servers allows 100 requests per minute, the client may receive:

1000 requests per minute

instead of 100.

Use shared state:

  • Redis
  • Memcached
  • Database counters
  • Distributed key-value store
  • API gateway
  • Service mesh

Redis is a common choice because it supports atomic increments, expiration, and Lua scripts.

A fixed-window Redis pattern:

INCR key
EXPIRE key window_seconds
reject if count > limit

The increment and expiration should be atomic or carefully ordered. Lua scripts are commonly used to avoid race conditions.

Cleanup and Memory

In-memory limiters must expire old state.

Fixed-window counters can remove old windows.

def cleanup_fixed_window(counters, current_window_id):
    old_keys = [
        state_key
        for state_key in counters
        if state_key[1] < current_window_id - 1
    ]

    for state_key in old_keys:
        del counters[state_key]

Sliding-window logs naturally remove old timestamps when each key is checked. But inactive keys may remain forever. Periodic cleanup is still useful.

Token buckets should remove idle keys after a retention period. If a bucket is full and inactive, it carries no useful state.

Response Headers

HTTP APIs often expose rate-limit status.

Typical headers:

RateLimit-Limit: 100
RateLimit-Remaining: 42
RateLimit-Reset: 17
Retry-After: 17

Response behavior:

def rate_limited_response(decision):
    return {
        "status": 429,
        "headers": {
            "Retry-After": str(int(decision.retry_after)),
        },
        "body": {
            "error": "rate_limited",
            "retry_after": decision.retry_after,
        },
    }

Expose enough information for clients to back off correctly, but avoid leaking details that help attackers optimize abuse.

Testing

Rate limiters should use a fake clock. Real sleeping makes tests slow and flaky.

class FakeClock:
    def __init__(self, now=0.0):
        self.now = now

    def time(self):
        return self.now

    def advance(self, seconds):
        self.now += seconds

Test fixed window:

def test_fixed_window_rejects_after_limit():
    limiter = FixedWindowRateLimiter(
        limit=2,
        window_seconds=60,
    )

    assert limiter.allow("u1", now=0).allowed is True
    assert limiter.allow("u1", now=1).allowed is True
    assert limiter.allow("u1", now=2).allowed is False

Test token refill:

def test_token_bucket_refills():
    limiter = TokenBucketRateLimiter(
        capacity=2,
        refill_rate_per_second=1,
    )

    assert limiter.allow("u1", now=0).allowed is True
    assert limiter.allow("u1", now=0).allowed is True
    assert limiter.allow("u1", now=0).allowed is False

    assert limiter.allow("u1", now=1).allowed is True

Test key isolation:

def test_keys_are_isolated():
    limiter = FixedWindowRateLimiter(
        limit=1,
        window_seconds=60,
    )

    assert limiter.allow("u1", now=0).allowed is True
    assert limiter.allow("u2", now=0).allowed is True
    assert limiter.allow("u1", now=0).allowed is False

Common Bugs

The most common bug is using local process memory in a horizontally scaled service and assuming the limit is global.

Another common bug is fixed-window boundary bursts. The implementation is correct but the policy is weaker than expected.

Time bugs appear when wall-clock time moves backward. For internal refill calculations, monotonic clocks are safer.

Concurrency bugs appear when multiple requests update the same key at the same time. Use locks locally and atomic operations in shared stores.

Composite-key bugs apply a limit too broadly or too narrowly. A per-IP login limit and a per-user API limit should not share the same key.

Memory leaks appear when inactive keys are never removed.

Multi-limit bugs consume one limiter and then reject on a later limiter, reducing capacity for a request that never ran.

Recipe

Build the limiter in layers.

Start with a fixed-window limiter to define the interface. Add a sliding-window log when precise rolling limits matter. Use token bucket for general API traffic because it supports bursts and controls sustained rate. Define keys explicitly. Apply multiple limits when needed. Use shared atomic state for distributed services. Return clear retry information. Test with a fake clock. Add cleanup for inactive keys.

The main lesson is that rate limiting is not only counting. It is a policy encoded as a time-based data structure. The chosen algorithm determines whether traffic is bursty, smooth, precise, approximate, local, or globally enforced.