Public Solution

Design WhatsApp

Q: What should I change first if traffic doubles?

Profile the bottleneck first, then scale the hot path component (usually compute, cache, or read path) before adding new system layers.

Q: Why is WebSockets emphasized in this solution?

It is the highest-leverage topic for this challenge constraints and directly improves score-impacting metrics like latency, availability, or resilience.

Q: How do I validate this architecture quickly?

Run the same challenge in the simulator, compare score breakdown metrics, and then test one tradeoff change at a time.

Design WhatsApp solution gives a production-minded baseline for this prompt. You get a concise requirements recap, a component-by-component architecture breakdown, explicit tradeoffs for latency, availability, cost, and complexity, plus failure mitigations and scoring rationale so you can benchmark your own design quickly.

HardWebsocketsDatabasesAuthReplication

View challenge prompt Explore WebSockets topic hub Guided lab: Your First System

Requirements Recap

Requirement	Target
Monthly active users	2,000,000,000
Messages per day	~100,000,000,000
Peak messages/second	~5,000,000
Max group size	1,024 members
Message delivery (online)	< 500 ms
Offline message retention	30 days
E2E encryption	Mandatory
Availability target	99.999%

Architecture Breakdown (Component-by-Component)

1. Mobile Clients
Represents mobile user traffic and request patterns.
Acts as an entry layer that routes traffic into the rest of the system.
2. DNS
Resolves domain names to reachable service endpoints.
Bridges 1 incoming flow to 1 downstream dependency.
3. Load Balancer
Distributes requests across healthy backend instances.
Bridges 1 incoming flow to 1 downstream dependency.
4. API Gateway
Handles api gateway responsibilities in this design.
Bridges 1 incoming flow to 2 downstream dependencies.
5. Auth Service
Verifies identity, sessions, and authorization decisions.
Acts as a sink or system-of-record endpoint in the architecture flow.
6. Core Service
Handles microservice responsibilities in this design.
Bridges 1 incoming flow to 2 downstream dependencies.
7. Realtime Bus
Handles pub sub responsibilities in this design.
Acts as a sink or system-of-record endpoint in the architecture flow.
8. Primary NoSQL DB
Stores high-scale data with flexible schema and throughput.
Bridges 1 incoming flow to 1 downstream dependency.
9. Replica SQL DB
Persists relational data with transactional guarantees.
Acts as a sink or system-of-record endpoint in the architecture flow.

Tradeoffs (Latency / Availability / Cost / Complexity)

Decision	Latency	Availability	Cost	Complexity
Keep the request path focused on core business operations	Shorter synchronous path keeps average response time stable	Fewer inline dependencies reduce immediate failure blast radius	Avoids unnecessary infrastructure in the first rollout	Lower coordination overhead for small teams
Keep a clear system of record for transactional writes	Predictable read/write behavior with indexed access	Strong correctness with managed backup and recovery	Storage and IOPS spend grows with write volume	Schema evolution and query tuning required

Failure Modes and Mitigations

Failure mode: Primary datastore saturation increases latency and write timeouts
Mitigation: Tune indexes, add read offload where valid, and cap expensive query classes.

Why This Scores Well

Availability (35%): Redundant routing and data paths reduce single points of failure under burst traffic.
Latency (20%): Critical-path components are intentionally minimal to keep average latency stable.
Resilience (25%): Clear role separation and bounded dependencies reduce cascading-failure risk.
Cost Efficiency (10%) + Simplicity (10%): Higher complexity is scoped to requirements that actually demand scale or stronger fault tolerance.

Next Step

Validate this architecture by solving the prompt yourself, then practice the highest-leverage component in a guided lab and topic hub.

Try solving Practice this component WebSockets topic hub

FAQ

What should I change first if traffic doubles?
Profile the bottleneck first, then scale the hot path component (usually compute, cache, or read path) before adding new system layers.
Why is WebSockets emphasized in this solution?
It is the highest-leverage topic for this challenge constraints and directly improves score-impacting metrics like latency, availability, or resilience.
How do I validate this architecture quickly?
Run the same challenge in the simulator, compare score breakdown metrics, and then test one tradeoff change at a time.

Related Reading

Real-Time Chat System Design: From WebSockets to Message Delivery

How to design a chat system that handles millions of concurrent connections, guarantees message ordering, and supports offline delivery.

Design WhatsApp

HardWebsocketsDatabasesAuthReplication

Requirement

Target

Monthly active users

2,000,000,000

Messages per day

~100,000,000,000

Peak messages/second

~5,000,000

Max group size

1,024 members

Message delivery (online)

< 500 ms

Offline message retention

30 days

E2E encryption

Mandatory

Availability target

99.999%

Architecture Breakdown (Component-by-Component)

1. Mobile Clients

Represents mobile user traffic and request patterns.

Acts as an entry layer that routes traffic into the rest of the system.

2. DNS

Resolves domain names to reachable service endpoints.

Bridges 1 incoming flow to 1 downstream dependency.

3. Load Balancer

Distributes requests across healthy backend instances.

Bridges 1 incoming flow to 1 downstream dependency.

4. API Gateway

Handles api gateway responsibilities in this design.

Bridges 1 incoming flow to 2 downstream dependencies.

5. Auth Service

Verifies identity, sessions, and authorization decisions.

Acts as a sink or system-of-record endpoint in the architecture flow.

6. Core Service

Handles microservice responsibilities in this design.

Bridges 1 incoming flow to 2 downstream dependencies.

7. Realtime Bus

Handles pub sub responsibilities in this design.

Acts as a sink or system-of-record endpoint in the architecture flow.

8. Primary NoSQL DB

Stores high-scale data with flexible schema and throughput.

Bridges 1 incoming flow to 1 downstream dependency.

9. Replica SQL DB

Persists relational data with transactional guarantees.

Acts as a sink or system-of-record endpoint in the architecture flow.

Tradeoffs (Latency / Availability / Cost / Complexity)

Decision	Latency	Availability	Cost	Complexity
Keep the request path focused on core business operations	Shorter synchronous path keeps average response time stable	Fewer inline dependencies reduce immediate failure blast radius	Avoids unnecessary infrastructure in the first rollout	Lower coordination overhead for small teams
Keep a clear system of record for transactional writes	Predictable read/write behavior with indexed access	Strong correctness with managed backup and recovery	Storage and IOPS spend grows with write volume	Schema evolution and query tuning required

Why This Scores Well

Availability (35%): Redundant routing and data paths reduce single points of failure under burst traffic.

Latency (20%): Critical-path components are intentionally minimal to keep average latency stable.

Resilience (25%): Clear role separation and bounded dependencies reduce cascading-failure risk.

Cost Efficiency (10%) + Simplicity (10%): Higher complexity is scoped to requirements that actually demand scale or stronger fault tolerance.

FAQ

What should I change first if traffic doubles?

Profile the bottleneck first, then scale the hot path component (usually compute, cache, or read path) before adding new system layers.

Why is WebSockets emphasized in this solution?

It is the highest-leverage topic for this challenge constraints and directly improves score-impacting metrics like latency, availability, or resilience.

How do I validate this architecture quickly?

Run the same challenge in the simulator, compare score breakdown metrics, and then test one tradeoff change at a time.