Introduction to System Design

What Is System Design?

System design is the discipline of making high-level decisions about how a software system is structured. It answers questions like: Where does data live? How do components communicate? What happens when one piece fails? How does the system behave when traffic grows 100x?

Unlike algorithm problems, which ask you to optimize a single function, system design problems are open-ended. There is no single correct answer. Every design is a set of trade-offs. The goal is to pick trade-offs that best serve the system's requirements.

System design is relevant to every engineer, not just senior architects. Understanding how systems are put together helps you write better code, debug production incidents, and make informed decisions about technology choices in your day-to-day work.

Why System Design Matters

Scalability: A well-designed system can handle growth in users, data, and traffic without being rewritten from scratch.
Reliability: Good design anticipates failures and builds redundancy so the system continues operating even when components break.
Performance: Architecture decisions determine latency and throughput. A bad choice at the design stage cannot be compensated for by optimizing code later.
Maintainability: Clean separation of concerns makes the system easier to understand, modify, and extend over time.
Cost: Over-engineering wastes money. Under-engineering causes outages. System design is about finding the right balance.

Core Vocabulary

Before going further, internalize these terms. They recur in every chapter.

Term	Definition
Scalability	The ability of a system to handle increased load by adding resources.
Latency	The time it takes for a single request to travel from client to server and back. Measured in milliseconds.
Throughput	The number of requests or operations a system can handle per unit of time (e.g., requests per second).
Availability	The percentage of time a system is operational. Often expressed as "nines" (99.9% = three nines).
Redundancy	Duplication of critical components so the system continues to function if one fails.
Partition	Splitting data or responsibility across multiple nodes.
Consistency	All nodes in the system see the same data at the same time.
Trade-off	Gaining one quality (e.g., consistency) at the expense of another (e.g., latency).
SLA	Service Level Agreement: a contract specifying expected uptime, latency, etc.
Bottleneck	The component that limits the overall performance of the system.

How to Approach a System Design Problem

Whether in an interview or a real architecture review, follow this structured approach:

Clarify Requirements

Ask questions. Define functional requirements (what the system should do) and non-functional requirements (performance, scale, availability, consistency). Establish scope boundaries.

Estimate Scale

How many users? How many requests per second? How much data per day? Back-of-the-envelope math prevents you from over- or under-designing.

Define the High-Level Design

Draw the major components: clients, servers, databases, caches, load balancers, message queues. Show how data flows between them.

Deep-Dive into Components

Pick the most critical or complex component and design it in detail. Discuss database schema, API contracts, caching strategy, or failure handling.

Identify Bottlenecks & Trade-offs

Where will the system break under extreme load? What happens during a network partition? Propose mitigations and acknowledge trade-offs.

Thinking in Trade-offs

Every architectural decision is a trade-off. There is no design that maximizes all desirable properties simultaneously. Here are common trade-off axes:

Consistency vs. Availability

Strong consistency means every read returns the latest write.
Favoring availability means the system always responds, even if data might be stale.
The CAP theorem formalizes this (covered in Chapter 12).

Latency vs. Throughput

Batching increases throughput but adds latency to individual requests.
Processing immediately keeps latency low but may limit throughput.
Choose based on user expectations and SLAs.

Simplicity vs. Flexibility

A monolith is simpler to develop and deploy.
Microservices offer flexibility but add operational complexity.
Start simple; decompose later when the need is clear.

Cost vs. Performance

More servers, more caching, more replicas all cost money.
Premature optimization is wasted effort and budget.
Design for current needs with a clear path to scale.

Back-of-the-Envelope Estimation

Before designing, estimate the numbers. This helps you pick the right tools and identify bottlenecks early.

Key Numbers to Memorize

L1 cache reference: ~1 ns
L2 cache reference: ~4 ns
Main memory reference: ~100 ns
SSD random read: ~150 us
HDD seek: ~10 ms
Round trip within same datacenter: ~0.5 ms
Round trip CA to Netherlands: ~150 ms
1 MB over 1 Gbps network: ~10 ms
Read 1 MB sequentially from SSD: ~1 ms
Read 1 MB sequentially from HDD: ~20 ms

Example Estimation: A Social Media App

Suppose you are designing a social media app with 10 million daily active users (DAU).

Each user makes ~20 requests/day on average.
Total requests/day: 10M x 20 = 200M.
Requests per second (average): 200M / 86,400 = ~2,300 RPS.
Peak traffic is typically 3-5x average: ~7,000-12,000 RPS at peak.
If each request transfers ~5 KB: 200M x 5 KB = 1 TB/day of bandwidth.
If you store 1 post per user per day at ~2 KB: 10M x 2 KB = 20 GB/day of new data.

These numbers tell you: a single server cannot handle this. You need load balancing, caching, and likely database sharding.

The Building Blocks

Every system is composed of a small set of fundamental building blocks. The remainder of this course covers each in depth. Here is the map:

System Design Building Blocks

What This Course Covers

This course is structured as five progressive modules:

Foundations (Chapters 1-4): Client-server architecture, networking, databases.
Performance & Scaling (Chapters 5-8): Caching, load balancing, database scaling, message queues.
Architecture Patterns (Chapters 9-12): Microservices, API design, consistent hashing, CAP theorem.
Production Systems (Chapters 13-15): Rate limiting, CDNs, monitoring.
Case Studies (Chapters 16-18): Apply everything to real-world design problems.

After each module, a quiz tests your understanding. Read sequentially or jump to any chapter that interests you.

How to Get the Most Out of This Course

Read each chapter carefully. Take notes. Draw your own diagrams. After finishing a module, take the corresponding quiz without looking back at the material first. If you score below 80%, re-read the chapters you struggled with. Understanding comes from repetition and application.

Chapter Check-Up

Quick quiz to reinforce what you just learned.