Rate Limiting & Throttling

Rate limiting controls the rate at which clients can make requests to a service. It is a critical defense mechanism against abuse, DDoS attacks, and runaway clients, and an essential tool for maintaining fairness and stability in multi-tenant systems.

Why Rate Limit?

  • Prevent abuse: Stop malicious users or bots from overwhelming the system.
  • Protect resources: Ensure database connections, CPU, and memory are not exhausted by a single client.
  • Ensure fairness: In multi-tenant systems, prevent one tenant from consuming all capacity.
  • Control costs: Limit API calls to downstream paid services.
  • Contractual enforcement: Enforce API usage tiers (free: 100 req/min, pro: 10,000 req/min).

Rate Limiting Algorithms

Token Bucket

A bucket holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected or queued. The bucket has a maximum capacity, allowing short bursts.

  • Rate: How many tokens are added per second (steady state throughput).
  • Capacity: Maximum tokens the bucket can hold (burst allowance).
  • Used by: AWS API Gateway, Stripe, many API platforms.

Leaky Bucket

Requests enter a queue (the bucket). They are processed at a fixed rate: like water leaking from a bucket. If the bucket is full, new requests are rejected. The output rate is strictly constant, with no bursting.

  • Produces a perfectly smooth output rate.
  • Does not allow bursts, which can be a disadvantage for bursty workloads.

Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute windows). Count requests in each window. If the count exceeds the limit, reject additional requests until the next window starts.

  • Simple to implement.
  • Problem: Burst at window boundary. A client could send the maximum at the end of one window and the beginning of the next, effectively doubling the rate for a short period.

Sliding Window Log

Store a timestamp for every request. When a new request arrives, remove timestamps older than the window size, then count remaining entries. If the count exceeds the limit, reject.

  • Precise: no boundary burst problem.
  • Memory-intensive: stores every request timestamp.

Sliding Window Counter

A hybrid approach. Combine the current fixed window count with a weighted portion of the previous window's count. This approximates a sliding window with the memory efficiency of fixed windows.

AlgorithmBurstingMemoryPrecision
Token BucketControlled bursts (configurable)LowGood
Leaky BucketNo burstsLowGood
Fixed WindowBoundary burst problemVery lowApproximate
Sliding Window LogNoneHighExact
Sliding Window CounterMinimalLowApproximate

Distributed Rate Limiting

When you have multiple application servers, each server needs to check the same counter. Options:

  • Centralized store (Redis): Use Redis with atomic INCR and EXPIRE commands. All servers check the same counters. Simple and effective but adds a network hop for every request.
  • Local rate limiting with sync: Each server maintains local counters and periodically syncs with a central store. Faster but allows temporary overages.
  • API Gateway: Rate limit at the gateway level (Kong, AWS API Gateway, NGINX) before requests reach application servers. Centralized enforcement without application-level logic.

Redis-Based Implementation

Pseudocode# Token Bucket with Redis def allow_request(client_id, rate, capacity): key = f"rate:{client_id}" now = current_time() # Atomic operation in Redis tokens, last_refill = redis.get(key) or (capacity, now) # Calculate tokens to add since last refill elapsed = now - last_refill tokens = min(capacity, tokens + elapsed * rate) if tokens >= 1: tokens -= 1 redis.set(key, (tokens, now), expire=window_size) return True # Allow else: return False # Reject (429 Too Many Requests)

Response Headers

Communicate rate limit status to clients through standard headers:

HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1612345678 Retry-After: 30

Rate Limiting by Dimension

  • Per user/API key: Most common. Each authenticated user has their own limits.
  • Per IP address: Useful for unauthenticated endpoints. Can be circumvented with proxies.
  • Per endpoint: Different limits for different endpoints (write endpoints may have lower limits).
  • Global: An overall system-wide limit to protect infrastructure.

Key Takeaways

  • Rate limiting protects services from abuse, ensures fairness, and enforces usage tiers.
  • Token bucket is the most popular algorithm: it handles bursts while enforcing a steady-state rate.
  • For distributed systems, use Redis-backed counters or rate limit at the API gateway layer.
  • Always return proper rate limit headers so clients can adapt their behavior.
  • Layer your rate limits: per-user, per-endpoint, and global limits working together.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

๐Ÿงช

Practice What You Learned

Configure rate limiting with token bucket algorithm in our API Gateway guided lab.

Start Guided Lab โ†’