Capacity Estimation & Back-of-Envelope Math

In system design interviews, you are frequently asked to estimate the scale of the system before diving into architecture. Back-of-the-envelope calculations help you determine how many servers, how much storage, and how much bandwidth your design needs. This chapter teaches you a systematic framework for producing reasonable estimates quickly and confidently.

Why Interviewers Ask for Estimates

Capacity estimation is not about getting the exact number. Interviewers want to see three things:

  • Structured thinking: Can you break a complex problem into smaller, manageable pieces?
  • Reasonable assumptions: Do you know the order of magnitude of common system metrics?
  • Design implications: Can you translate numbers into architectural decisions? For example, if your write QPS is 50,000, a single SQL database won't suffice, and you need sharding or a NoSQL store.

Estimates drive design choices. A system serving 100 users per day looks completely different from one serving 100 million. By doing the math early, you avoid over-engineering small systems and under-engineering large ones.

Numbers Every Engineer Should Know

These reference numbers, originally popularized by Jeff Dean, form the foundation of all back-of-the-envelope calculations. You do not need to memorize every number, but you should know their order of magnitude.

OperationLatencyNotes
L1 cache reference0.5 nsFastest memory access
L2 cache reference7 ns~14x L1
Main memory (RAM) reference100 ns~200x L1
SSD random read150 μs~1,500x RAM
HDD random read (seek)10 ms~100,000x RAM
Send 1 KB over 1 Gbps network10 μsWithin same data center
Round trip within same data center0.5 msNetwork + processing
Round trip CA to Netherlands150 msCross-continent
Read 1 MB sequentially from RAM250 μs
Read 1 MB sequentially from SSD1 ms4x RAM
Read 1 MB sequentially from HDD20 ms80x RAM
Disk seek10 msMechanical delay
Compress 1 KB with Snappy3 μsFast compression
Mutex lock/unlock25 ns

Storage and Scale Reference

UnitApproximate ValueUseful For
1 KB1,000 bytesA small JSON payload, a tweet
1 MB1,000 KBA high-res photo, a short MP3
1 GB1,000 MBA movie, ~300 photos
1 TB1,000 GBA large database table
1 PB1,000 TBEntire company data warehouse
Seconds in a day~86,400 (~105)QPS calculations
Seconds in a month~2.5 million (~2.5 × 106)Monthly volume
Seconds in a year~31.5 million (~3 × 107)Annual volume

The Estimation Framework

Every capacity estimation follows the same four-step workflow. Practice this order so it becomes second nature in interviews.

1

Estimate Traffic (QPS)

Start with Daily Active Users (DAU). Multiply by actions per user per day. Divide by 86,400 seconds to get average queries per second (QPS). Multiply by a peak factor (typically 2x to 5x) for peak QPS.

2

Estimate Storage

Calculate the size of a single record or object. Multiply by the number of new records per day. Multiply by the retention period (e.g., 5 years). Add overhead for indexes, replication, and backups.

3

Estimate Bandwidth

Multiply average request/response size by QPS to get bytes per second. Calculate separately for incoming (write) and outgoing (read) traffic, as reads usually dominate.

4

Estimate Memory (Cache)

Apply the 80/20 rule: 20% of data generates 80% of reads. Cache the top 20% of daily read requests. Memory needed = 20% × daily read volume × average object size.

Traffic Estimation: DAU to QPS

The fundamental formula is:

Average QPS = (DAU × actions_per_user_per_day) / seconds_in_a_day

Peak QPS   = Average QPS × peak_factor

Example:
  DAU = 100 million
  Actions per user per day = 10 (reads) + 2 (writes) = 12
  Average QPS = 100M × 12 / 86,400 ≈ 13,900 QPS
  Peak QPS (3x) ≈ 42,000 QPS

Always separate reads from writes, because they have different architectural implications. A system that is 90% reads and 10% writes benefits from read replicas and heavy caching. A system that is write-heavy needs sharding, write-ahead logs, and possibly eventual consistency.

Storage Estimation

Estimate per-record size, then scale up:

Record size = sum of all field sizes
  user_id:     8 bytes (bigint)
  content:   250 bytes (avg text)
  timestamp:   8 bytes
  metadata:   50 bytes
  Total:     ~316 bytes ≈ round to 500 bytes (with overhead)

Daily new records = DAU × writes_per_user = 100M × 2 = 200M
Daily storage = 200M × 500 bytes = 100 GB/day
Annual storage = 100 GB × 365 = 36.5 TB/year
5-year storage = 36.5 TB × 5 = 182.5 TB

With 3x replication: 182.5 TB × 3 ≈ 550 TB

Bandwidth Estimation

Incoming bandwidth (writes):
  Write QPS = 2,300
  Average write payload = 500 bytes
  Bandwidth = 2,300 × 500 = 1.15 MB/s

Outgoing bandwidth (reads):
  Read QPS = 11,600
  Average response size = 2 KB (record + metadata)
  Bandwidth = 11,600 × 2 KB = 23.2 MB/s

Memory Estimation (Caching)

The 80/20 rule (Pareto principle) states that roughly 80% of requests access 20% of the data. Caching this hot subset dramatically reduces database load.

Daily read requests = Read QPS × 86,400 = 11,600 × 86,400 ≈ 1 billion
Average response size = 2 KB
Total daily read data = 1B × 2 KB = 2 TB

Cache 20% of daily reads:
  Cache memory = 0.2 × 2 TB = 400 GB

This fits in a cluster of ~10 machines with 64 GB RAM each,
or a few Redis instances with high-memory configurations.

Worked Example 1: URL Shortener

Assumptions

  • 100 million new URLs shortened per month
  • Read-to-write ratio: 100:1 (URLs are created once, read many times)
  • Each shortened URL record: ~500 bytes (short code, original URL, creation timestamp, user ID, expiry)
  • Retention: 5 years
=== Traffic ===
Write QPS = 100M / (30 × 86,400) ≈ 100M / 2.6M ≈ 40 QPS
Read QPS  = 40 × 100 = 4,000 QPS
Peak read QPS (3x) = 12,000 QPS

=== Storage ===
Records over 5 years = 100M × 12 × 5 = 6 billion
Storage = 6B × 500 bytes = 3 TB
With replication (3x) = 9 TB

=== Bandwidth ===
Write: 40 × 500 bytes = 20 KB/s (negligible)
Read:  4,000 × 500 bytes = 2 MB/s

=== Cache ===
Daily reads = 4,000 × 86,400 ≈ 345 million
Cache 20%: 69M × 500 bytes ≈ 35 GB
Fits in a single Redis instance.

Worked Example 2: Twitter-like Feed

Assumptions

  • 300 million Monthly Active Users (MAU), 150 million DAU
  • Each user posts 2 tweets/day on average
  • Each user reads their timeline 10 times/day, 20 tweets per load
  • Average tweet: 280 characters (~300 bytes) + metadata (~200 bytes) = 500 bytes
  • 10% of tweets include media (average image: 200 KB)
=== Traffic ===
Tweet writes = 150M × 2 / 86,400 ≈ 3,500 write QPS
Timeline reads = 150M × 10 / 86,400 ≈ 17,400 read QPS
Peak (3x): ~52,000 read QPS

=== Storage (text only, per year) ===
Daily tweets = 150M × 2 = 300M
Daily text storage = 300M × 500 bytes = 150 GB/day
Annual text = 150 GB × 365 ≈ 55 TB/year

=== Storage (media, per year) ===
Daily media tweets = 300M × 0.10 = 30M
Daily media storage = 30M × 200 KB = 6 TB/day
Annual media = 6 TB × 365 ≈ 2.2 PB/year
(Media goes to object storage like S3, not database)

=== Cache ===
Daily timeline reads = 17,400 QPS × 86,400 ≈ 1.5 billion
Each timeline fetch returns ~20 tweets × 500 bytes = 10 KB
Cache 20% of unique data: ~300M × 10 KB ≈ 3 TB
Distributed across ~50 cache machines.

Worked Example 3: YouTube-like Video Platform

Assumptions

  • 2 billion MAU, 800 million DAU
  • Each user watches 5 videos/day
  • 500,000 new videos uploaded per day
  • Average video: 300 MB (after transcoding to multiple resolutions)
  • Average video metadata: 5 KB
=== Traffic ===
Video views = 800M × 5 / 86,400 ≈ 46,000 read QPS
Video uploads = 500,000 / 86,400 ≈ 6 upload QPS

=== Storage (per day) ===
Video storage = 500,000 × 300 MB = 150 TB/day
Annual video storage = 150 TB × 365 ≈ 55 PB/year
(Stored in distributed object storage, CDN-served)

Metadata storage = 500,000 × 5 KB = 2.5 GB/day (trivial)

=== Bandwidth ===
Assume average video stream bitrate: 5 Mbps
Concurrent viewers (peak): ~10 million
Peak bandwidth = 10M × 5 Mbps = 50 Tbps
(This is why CDNs are essential for video platforms)

Common Mistakes

  • Confusing MB and Mb: 1 byte = 8 bits. A 100 Mbps network link carries 12.5 MB/s. Always clarify units.
  • Forgetting replication: Real systems replicate data 3x (or more) for durability. Your raw storage number should be multiplied accordingly.
  • Ignoring peak vs. average: Systems must handle peak load, not just average. Use a 2x-5x peak factor.
  • Over-precision: Saying "13,888.89 QPS" is false precision. Say "~14,000 QPS" or "~14K QPS." Round aggressively.
  • Forgetting indexes and overhead: Database indexes, filesystem metadata, and encoding overhead can add 20-50% to raw data size.
  • Not separating reads and writes: They have very different performance profiles and drive different architectural choices.

Presentation Tips for Interviews

  • State your assumptions clearly before calculating. "Let's assume 100 million DAU" is better than jumping straight into math.
  • Round aggressively. Use powers of 10. 86,400 seconds/day becomes ~100,000 for quick math. 2.5 million seconds/month becomes ~2.5M.
  • Write on the whiteboard. Organize your numbers visually: Traffic | Storage | Bandwidth | Cache in four columns.
  • Sanity-check your results. If your URL shortener needs 50 PB of storage, something is wrong. Compare against known real-world systems.
  • Connect estimates to design decisions. "Since we need 12K read QPS and only 40 write QPS, the system is read-heavy. Let's add a cache layer and read replicas."

Quick Reference: Powers of Two

PowerExact ValueApproximateName
2101,024~1 Thousand1 KB
2201,048,576~1 Million1 MB
2301,073,741,824~1 Billion1 GB
2401,099,511,627,776~1 Trillion1 TB
250~1.13 × 1015~1 Quadrillion1 PB

Summary

Capacity estimation is a structured, repeatable skill. Follow the four-step framework (traffic, storage, bandwidth, cache), memorize the key reference numbers, round aggressively, and always connect your estimates back to design decisions. The goal is not perfection, it is demonstrating that you can reason about scale systematically.

Chapter Check-Up

Quick quiz to reinforce what you just learned.