What Is System Design?
System design is the discipline of making high-level decisions about how a software system is structured. It answers questions like: Where does data live? How do components communicate? What happens when one piece fails? How does the system behave when traffic grows 100x?
Unlike algorithm problems, which ask you to optimize a single function, system design problems are open-ended. There is no single correct answer. Every design is a set of trade-offs. The goal is to pick trade-offs that best serve the system's requirements.
System design is relevant to every engineer, not just senior architects. Understanding how systems are put together helps you write better code, debug production incidents, and make informed decisions about technology choices in your day-to-day work.
Why System Design Matters
- Scalability: A well-designed system can handle growth in users, data, and traffic without being rewritten from scratch.
- Reliability: Good design anticipates failures and builds redundancy so the system continues operating even when components break.
- Performance: Architecture decisions determine latency and throughput. A bad choice at the design stage cannot be compensated for by optimizing code later.
- Maintainability: Clean separation of concerns makes the system easier to understand, modify, and extend over time.
- Cost: Over-engineering wastes money. Under-engineering causes outages. System design is about finding the right balance.
Core Vocabulary
Before going further, internalize these terms. They recur in every chapter.
| Term | Definition |
|---|---|
| Scalability | The ability of a system to handle increased load by adding resources. |
| Latency | The time it takes for a single request to travel from client to server and back. Measured in milliseconds. |
| Throughput | The number of requests or operations a system can handle per unit of time (e.g., requests per second). |
| Availability | The percentage of time a system is operational. Often expressed as "nines" (99.9% = three nines). |
| Redundancy | Duplication of critical components so the system continues to function if one fails. |
| Partition | Splitting data or responsibility across multiple nodes. |
| Consistency | All nodes in the system see the same data at the same time. |
| Trade-off | Gaining one quality (e.g., consistency) at the expense of another (e.g., latency). |
| SLA | Service Level Agreement: a contract specifying expected uptime, latency, etc. |
| Bottleneck | The component that limits the overall performance of the system. |
How to Approach a System Design Problem
Whether in an interview or a real architecture review, follow this structured approach:
Clarify Requirements
Ask questions. Define functional requirements (what the system should do) and non-functional requirements (performance, scale, availability, consistency). Establish scope boundaries.
Estimate Scale
How many users? How many requests per second? How much data per day? Back-of-the-envelope math prevents you from over- or under-designing.
Define the High-Level Design
Draw the major components: clients, servers, databases, caches, load balancers, message queues. Show how data flows between them.
Deep-Dive into Components
Pick the most critical or complex component and design it in detail. Discuss database schema, API contracts, caching strategy, or failure handling.
Identify Bottlenecks & Trade-offs
Where will the system break under extreme load? What happens during a network partition? Propose mitigations and acknowledge trade-offs.
Thinking in Trade-offs
Every architectural decision is a trade-off. There is no design that maximizes all desirable properties simultaneously. Here are common trade-off axes:
Consistency vs. Availability
- Strong consistency means every read returns the latest write.
- Favoring availability means the system always responds, even if data might be stale.
- The CAP theorem formalizes this (covered in Chapter 12).
Latency vs. Throughput
- Batching increases throughput but adds latency to individual requests.
- Processing immediately keeps latency low but may limit throughput.
- Choose based on user expectations and SLAs.
Simplicity vs. Flexibility
- A monolith is simpler to develop and deploy.
- Microservices offer flexibility but add operational complexity.
- Start simple; decompose later when the need is clear.
Cost vs. Performance
- More servers, more caching, more replicas all cost money.
- Premature optimization is wasted effort and budget.
- Design for current needs with a clear path to scale.
Back-of-the-Envelope Estimation
Before designing, estimate the numbers. This helps you pick the right tools and identify bottlenecks early.
- L1 cache reference: ~1 ns
- L2 cache reference: ~4 ns
- Main memory reference: ~100 ns
- SSD random read: ~150 us
- HDD seek: ~10 ms
- Round trip within same datacenter: ~0.5 ms
- Round trip CA to Netherlands: ~150 ms
- 1 MB over 1 Gbps network: ~10 ms
- Read 1 MB sequentially from SSD: ~1 ms
- Read 1 MB sequentially from HDD: ~20 ms
Example Estimation: A Social Media App
Suppose you are designing a social media app with 10 million daily active users (DAU).
- Each user makes ~20 requests/day on average.
- Total requests/day: 10M x 20 = 200M.
- Requests per second (average): 200M / 86,400 = ~2,300 RPS.
- Peak traffic is typically 3-5x average: ~7,000-12,000 RPS at peak.
- If each request transfers ~5 KB: 200M x 5 KB = 1 TB/day of bandwidth.
- If you store 1 post per user per day at ~2 KB: 10M x 2 KB = 20 GB/day of new data.
These numbers tell you: a single server cannot handle this. You need load balancing, caching, and likely database sharding.
The Building Blocks
Every system is composed of a small set of fundamental building blocks. The remainder of this course covers each in depth. Here is the map:
What This Course Covers
This course is structured as five progressive modules:
- Foundations (Chapters 1-4): Client-server architecture, networking, databases.
- Performance & Scaling (Chapters 5-8): Caching, load balancing, database scaling, message queues.
- Architecture Patterns (Chapters 9-12): Microservices, API design, consistent hashing, CAP theorem.
- Production Systems (Chapters 13-15): Rate limiting, CDNs, monitoring.
- Case Studies (Chapters 16-18): Apply everything to real-world design problems.
After each module, a quiz tests your understanding. Read sequentially or jump to any chapter that interests you.
Chapter Check-Up
Quick quiz to reinforce what you just learned.