Rate Limiting in System Design | System Design Course

Why Rate Limit?

Prevent abuse: Stop malicious users or bots from overwhelming the system.
Protect resources: Ensure database connections, CPU, and memory are not exhausted by a single client.
Ensure fairness: In multi-tenant systems, prevent one tenant from consuming all capacity.
Control costs: Limit API calls to downstream paid services.
Contractual enforcement: Enforce API usage tiers (free: 100 req/min, pro: 10,000 req/min).

Rate Limiting Algorithms

Token Bucket

A bucket holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected or queued. The bucket has a maximum capacity, allowing short bursts.

Rate: How many tokens are added per second (steady state throughput).
Capacity: Maximum tokens the bucket can hold (burst allowance).
Used by: AWS API Gateway, Stripe, many API platforms.

Leaky Bucket

Requests enter a queue (the bucket). They are processed at a fixed rate: like water leaking from a bucket. If the bucket is full, new requests are rejected. The output rate is strictly constant, with no bursting.

Produces a perfectly smooth output rate.
Does not allow bursts, which can be a disadvantage for bursty workloads.

Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute windows). Count requests in each window. If the count exceeds the limit, reject additional requests until the next window starts.

Simple to implement.
Problem: Burst at window boundary. A client could send the maximum at the end of one window and the beginning of the next, effectively doubling the rate for a short period.

Sliding Window Log

Store a timestamp for every request. When a new request arrives, remove timestamps older than the window size, then count remaining entries. If the count exceeds the limit, reject.

Precise: no boundary burst problem.
Memory-intensive: stores every request timestamp.

Sliding Window Counter

A hybrid approach. Combine the current fixed window count with a weighted portion of the previous window's count. This approximates a sliding window with the memory efficiency of fixed windows.

Algorithm	Bursting	Memory	Precision
Token Bucket	Controlled bursts (configurable)	Low	Good
Leaky Bucket	No bursts	Low	Good
Fixed Window	Boundary burst problem	Very low	Approximate
Sliding Window Log	None	High	Exact
Sliding Window Counter	Minimal	Low	Approximate

Distributed Rate Limiting

When you have multiple application servers, each server needs to check the same counter. Options:

Centralized store (Redis): Use Redis with atomic INCR and EXPIRE commands. All servers check the same counters. Simple and effective but adds a network hop for every request.
Local rate limiting with sync: Each server maintains local counters and periodically syncs with a central store. Faster but allows temporary overages.
API Gateway: Rate limit at the gateway level (Kong, AWS API Gateway, NGINX) before requests reach application servers. Centralized enforcement without application-level logic.

Redis-Based Implementation

Pseudocode# Token Bucket with Redis
def allow_request(client_id, rate, capacity):
    key = f"rate:{client_id}"
    now = current_time()

    # Atomic operation in Redis
    tokens, last_refill = redis.get(key) or (capacity, now)

    # Calculate tokens to add since last refill
    elapsed = now - last_refill
    tokens = min(capacity, tokens + elapsed * rate)

    if tokens >= 1:
        tokens -= 1
        redis.set(key, (tokens, now), expire=window_size)
        return True   # Allow
    else:
        return False  # Reject (429 Too Many Requests)

Response Headers

Communicate rate limit status to clients through standard headers:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1612345678
Retry-After: 30

Rate Limiting by Dimension

Per user/API key: Most common. Each authenticated user has their own limits.
Per IP address: Useful for unauthenticated endpoints. Can be circumvented with proxies.
Per endpoint: Different limits for different endpoints (write endpoints may have lower limits).
Global: An overall system-wide limit to protect infrastructure.

Key Takeaways

Rate limiting protects services from abuse, ensures fairness, and enforces usage tiers.
Token bucket is the most popular algorithm: it handles bursts while enforcing a steady-state rate.
For distributed systems, use Redis-backed counters or rate limit at the API gateway layer.
Always return proper rate limit headers so clients can adapt their behavior.
Layer your rate limits: per-user, per-endpoint, and global limits working together.

Rate Limiting & Throttling