HardEnterprise

Design WhatsApp

WebSocketsDatabasesAuthReplicationGeo DistributionConsistency

Problem Statement

Design the architecture for WhatsApp - the world's most-used messaging app with 2 billion monthly active users sending 100 billion messages per day. Your design must cover:

- 1:1 messaging - real-time text messaging with end-to-end encryption (Signal Protocol). Messages are delivered in order, exactly once, with delivery receipts (sent ✓, delivered ✓✓, read 🔵).Group chats - groups of up to 1,024 members. Messages are fan-out to all members with E2EE (sender encrypts once per member using pairwise keys).Offline delivery - when a recipient is offline, messages are stored server-side (encrypted) and delivered when they reconnect. Messages are deleted from the server after delivery.Media sharing - photos, videos (up to 2 GB), documents, and voice notes. Media is encrypted, uploaded to a blob store, and a download link is sent in the message.Voice & video calls - peer-to-peer WebRTC calls with TURN server fallback for NAT traversal.Status / Stories - ephemeral 24-hour updates visible to contacts.Last seen & online status - real-time presence indicators.

The key challenge is delivering 100 billion messages per day with E2E encryption, exactly-once semantics, and offline support - all while the server never sees plaintext.

What You'll Learn

Design WhatsApp's messaging platform - E2E encryption, group chats, media, and offline delivery for 2 B users. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

WebSocketsDatabasesAuthReplicationGeo DistributionConsistency

Constraints

Monthly active users2,000,000,000
Messages per day~100,000,000,000
Peak messages/second~5,000,000
Max group size1,024 members
Message delivery (online)< 500 ms
Offline message retention30 days
E2E encryptionMandatory
Availability target99.999%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design WhatsApp's messaging platform - E2E encryption, group chats, media, and offline delivery for 2 B users.
  • Design for a peak load target around 80,000 RPS (including burst headroom).
  • Monthly active users: 2,000,000,000
  • Messages per day: ~100,000,000,000
  • Peak messages/second: ~5,000,000
  • Max group size: 1,024 members
  • Message delivery (online): < 500 ms

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • Replication: Separate primary write path from replicated read path and define lag tolerance per feature.
  • Geo Distribution: Route users to nearest region/edge while keeping write-consistency boundaries explicit.
  • Consistency: Classify operations by consistency requirement: strong for money/inventory, eventual for feeds/analytics.

4) Reliability and Failure Strategy

  • Track connection churn, backpressure, and session resumption behavior.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Use short-lived tokens and secure key rotation workflows.
  • Monitor replication lag and have failover runbooks with recovery point objectives.
  • Design region failover and data residency controls as first-class requirements.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.
  • Replication: Replication improves read scale and DR posture but complicates consistency semantics.
  • Geo Distribution: Global latency improves, but cross-region consistency and operations become harder.

Practical Notes

  • Each user maintains a persistent connection (WebSocket/MQTT) to a gateway server. A connection registry maps user → gateway server.
  • Messages are stored in an ordered queue per recipient - dequeued and deleted after delivery confirmation.
  • Signal Protocol: each chat has a ratcheting key. The server stores ciphertext only; key exchange happens client-side.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Mobile Clients -> DNS -> Load Balancer -> API Gateway -> Core Service -> Auth Service -> Primary NoSQL DB -> Replica SQL DB

Design strengths

  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.