Why Rate Limit?
- Prevent abuse: Stop malicious users or bots from overwhelming the system.
- Protect resources: Ensure database connections, CPU, and memory are not exhausted by a single client.
- Ensure fairness: In multi-tenant systems, prevent one tenant from consuming all capacity.
- Control costs: Limit API calls to downstream paid services.
- Contractual enforcement: Enforce API usage tiers (free: 100 req/min, pro: 10,000 req/min).
Rate Limiting Algorithms
Token Bucket
A bucket holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected or queued. The bucket has a maximum capacity, allowing short bursts.
- Rate: How many tokens are added per second (steady state throughput).
- Capacity: Maximum tokens the bucket can hold (burst allowance).
- Used by: AWS API Gateway, Stripe, many API platforms.
Leaky Bucket
Requests enter a queue (the bucket). They are processed at a fixed rate: like water leaking from a bucket. If the bucket is full, new requests are rejected. The output rate is strictly constant, with no bursting.
- Produces a perfectly smooth output rate.
- Does not allow bursts, which can be a disadvantage for bursty workloads.
Fixed Window Counter
Divide time into fixed windows (e.g., 1-minute windows). Count requests in each window. If the count exceeds the limit, reject additional requests until the next window starts.
- Simple to implement.
- Problem: Burst at window boundary. A client could send the maximum at the end of one window and the beginning of the next, effectively doubling the rate for a short period.
Sliding Window Log
Store a timestamp for every request. When a new request arrives, remove timestamps older than the window size, then count remaining entries. If the count exceeds the limit, reject.
- Precise: no boundary burst problem.
- Memory-intensive: stores every request timestamp.
Sliding Window Counter
A hybrid approach. Combine the current fixed window count with a weighted portion of the previous window's count. This approximates a sliding window with the memory efficiency of fixed windows.
| Algorithm | Bursting | Memory | Precision |
|---|---|---|---|
| Token Bucket | Controlled bursts (configurable) | Low | Good |
| Leaky Bucket | No bursts | Low | Good |
| Fixed Window | Boundary burst problem | Very low | Approximate |
| Sliding Window Log | None | High | Exact |
| Sliding Window Counter | Minimal | Low | Approximate |
Distributed Rate Limiting
When you have multiple application servers, each server needs to check the same counter. Options:
- Centralized store (Redis): Use Redis with atomic INCR and EXPIRE commands. All servers check the same counters. Simple and effective but adds a network hop for every request.
- Local rate limiting with sync: Each server maintains local counters and periodically syncs with a central store. Faster but allows temporary overages.
- API Gateway: Rate limit at the gateway level (Kong, AWS API Gateway, NGINX) before requests reach application servers. Centralized enforcement without application-level logic.
Redis-Based Implementation
# Token Bucket with Redis
def allow_request(client_id, rate, capacity):
key = f"rate:{client_id}"
now = current_time()
# Atomic operation in Redis
tokens, last_refill = redis.get(key) or (capacity, now)
# Calculate tokens to add since last refill
elapsed = now - last_refill
tokens = min(capacity, tokens + elapsed * rate)
if tokens >= 1:
tokens -= 1
redis.set(key, (tokens, now), expire=window_size)
return True # Allow
else:
return False # Reject (429 Too Many Requests)Response Headers
Communicate rate limit status to clients through standard headers:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1612345678
Retry-After: 30Rate Limiting by Dimension
- Per user/API key: Most common. Each authenticated user has their own limits.
- Per IP address: Useful for unauthenticated endpoints. Can be circumvented with proxies.
- Per endpoint: Different limits for different endpoints (write endpoints may have lower limits).
- Global: An overall system-wide limit to protect infrastructure.
Key Takeaways
- Rate limiting protects services from abuse, ensures fairness, and enforces usage tiers.
- Token bucket is the most popular algorithm: it handles bursts while enforcing a steady-state rate.
- For distributed systems, use Redis-backed counters or rate limit at the API gateway layer.
- Always return proper rate limit headers so clients can adapt their behavior.
- Layer your rate limits: per-user, per-endpoint, and global limits working together.