Case Study: Design a Chat System

A real-time chat system like WhatsApp, Slack, or Discord requires persistent connections, message ordering, delivery guarantees, presence tracking, and efficient storage. This case study walks through designing a system that supports one-on-one messaging, group chats, and online presence at scale.

Step 1: Clarify Requirements

Functional Requirements

  • One-on-one messaging between two users.
  • Group chat with up to 500 members.
  • Online/offline presence indicators.
  • Message history and persistence: messages are stored permanently.
  • Multi-device support: a user can be logged in on phone and laptop simultaneously.
  • Read receipts and delivery status (sent, delivered, read).
  • Push notifications for offline users.

Non-Functional Requirements

  • Low latency: messages delivered in under 200ms between online users.
  • Message ordering must be preserved within a conversation.
  • At-least-once delivery: no messages are lost.
  • High availability: the chat service should be always reachable.
  • Support 50 million daily active users (DAU).

Step 2: Back-of-Envelope Estimates

EstimationDAU: 50 million Average messages/user/day: 40 Total messages/day: 50M * 40 = 2 billion Messages/second: 2B / 86,400 = ~23,000 msg/sec Message size (average): ~200 bytes (text + metadata) Daily storage: 2B * 200B = 400 GB / day 5-year storage: 400 GB * 365 * 5 = ~730 TB Concurrent connections (peak): ~10% of DAU online at peak = 5 million WebSocket connections

Step 3: High-Level Design

Sender (WebSocket) Chat Service (Stateful WS servers) Presence Service (Online/Offline) Message Queue (Kafka) Message Store (Cassandra/HBase) Routing Service (Find recipient WS) Push Service (APNs / FCM) Receiver (WebSocket) Session Store (Redis: user->server) offline?

Step 4: Deep Dive

Connection Protocol: WebSockets

HTTP is client-initiated: the server cannot push messages to the client without a request. For real-time chat, we need a persistent, bidirectional connection. WebSocket is the standard choice:

  • Full-duplex communication over a single TCP connection.
  • Client opens a WebSocket once and keeps it alive with heartbeats.
  • Server pushes messages to the client instantly without polling.
  • Each chat server maintains tens of thousands of concurrent WebSocket connections.

Message Flow: One-on-One

1
Send: User A sends a message over their WebSocket connection to Chat Server 1.
2
Persist: Chat Server 1 writes the message to the message store (Cassandra) and publishes it to the message queue (Kafka).
3
Route: The routing service consumes from Kafka. It looks up which chat server User B is connected to (from the session store in Redis).
4
Deliver: If User B is online, route the message to their chat server, which pushes it over User B's WebSocket. If offline, send a push notification via APNs/FCM.
5
Acknowledge: User B's client sends a delivery acknowledgment. Chat Server updates the message status to "delivered".

Message Flow: Group Chat

For a group with N members, the sender's message must be delivered to N-1 recipients:

  • Small groups (under 500): Publish the message to a per-group Kafka topic or fan-out to each member through the routing service.
  • Fan-out on write: When a message arrives, immediately create a copy in each recipient's inbox queue. Fast reads, expensive writes.
  • Fan-out on read: Store the message once in the group's message log. Each client reads messages it has not yet seen. Cheap writes, potentially slower reads.
  • Most systems use fan-out on write for small groups and fan-out on read for channels/large groups.

Message Storage

Chat message storage has a write-heavy, append-only, time-ordered access pattern. This makes it ideal for wide-column stores:

Cassandra SchemaCREATE TABLE messages ( conversation_id UUID, message_id TIMEUUID, -- time-ordered unique ID sender_id UUID, content TEXT, content_type TEXT, -- 'text', 'image', 'file' created_at TIMESTAMP, PRIMARY KEY (conversation_id, message_id) ) WITH CLUSTERING ORDER BY (message_id DESC); -- Partition key: conversation_id (all msgs in a conversation on one node) -- Clustering key: message_id (sorted by time within partition)
Why Not SQL?

Relational databases can work for smaller-scale chat, but at billions of messages per day, a wide-column store like Cassandra or HBase provides better write throughput, linear horizontal scaling, and efficient range queries by time within a conversation partition.

Message ID and Ordering

Messages must be ordered within each conversation. Options for generating ordered IDs:

  • TIMEUUID / Snowflake ID: Encodes timestamp in the ID. Naturally time-ordered. Used by Cassandra, Discord, Twitter.
  • Lamport timestamps: Logical clocks that ensure causal ordering even across servers.
  • Per-conversation sequence number: Incrementing counter per conversation. Simple, but requires coordination.

Presence Service

Tracking who is online:

  • When a user connects via WebSocket, mark them online in a Redis hash: presence:{user_id} -> {server_id, last_seen}.
  • Use heartbeats (every 30 seconds) to keep the entry alive. If no heartbeat, mark offline.
  • For a user's contact list, batch-fetch presence status for all contacts from Redis.
  • Publish presence changes to a Pub/Sub channel so friends are notified.
Presence at Scale

For 50M DAU, presence fan-out is expensive. If a user has 500 friends, going online triggers 500 notifications. Optimization: only send presence updates for users with the chat window actively open, or batch presence updates.

Multi-Device Sync

  • Each device maintains its own WebSocket connection, potentially to different chat servers.
  • The session store maps user_id -> [server1:device_phone, server2:device_laptop].
  • Messages are routed to all active devices for the recipient.
  • Each device tracks a last_seen_message_id to sync message history on reconnect.

Step 5: Scaling & Reliability

  • WebSocket server scaling: Each server handles ~50K-100K connections. For 5M concurrent, deploy 50-100 chat servers behind a load balancer with sticky sessions (by user_id).
  • Message queue partitioning: Partition Kafka topics by conversation_id to preserve ordering within a conversation.
  • Storage sharding: Cassandra partitions by conversation_id automatically. Hot conversations (celebrity group chats) may need further splitting.
  • Offline message sync: When a user comes online, the client sends its last_seen_message_id. The server returns all messages after that ID.
  • End-to-end encryption: Messages are encrypted on the sender's device and decrypted on the recipient's device. The server cannot read message content. Uses the Signal Protocol (Double Ratchet algorithm).

Key Takeaways

  • WebSocket provides the persistent, bidirectional connection needed for real-time messaging.
  • Decouple message persistence and delivery using a message queue (Kafka) for reliability.
  • Use a wide-column store (Cassandra) for write-heavy, time-ordered message storage.
  • A session store (Redis) maps each user to their active chat server for message routing.
  • Presence tracking and group message fan-out are the main scaling challenges: optimize with batching and lazy updates.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

๐Ÿงช

Build It Yourself

Design a real-time chat system with WebSockets, presence, and message queues in our guided lab.

Start Guided Lab โ†’