HardChat App · Part 2

Chat App 2 - End-to-End Encryption & Federation

WebSocketsDatabasesAuthGeo DistributionReplicationConsistency

This challenge builds on Chat App 1 - Team Messaging. Complete it first for the best experience.

Problem Statement

ThreadSpace has grown to 50 million users and is adding enterprise-grade security:

- End-to-end encryption (E2EE) - messages are encrypted on the sender's device and decrypted only on recipient devices. The server never sees plaintext. Key management across multiple devices per user is critical.Cross-organization messaging - users in different organizations can be invited to shared channels (federation). This introduces trust boundaries and key exchange challenges.Multi-region deployment - the service runs in 5 regions. Users should connect to the nearest region. Messages between users in the same org (likely same region) should be fast; cross-region messages can tolerate slightly higher latency.Compliance - message audit logs for regulated industries (the encrypted content is opaque to the server, but metadata - who messaged whom, when - must be retained for 5 years).

This challenge tests your understanding of security architecture, key management, and the trade-offs of E2EE at scale.

What You'll Learn

Add E2E encryption, cross-org messaging, and scale to 50 M users across regions. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

WebSocketsDatabasesAuthGeo DistributionReplicationConsistency

Constraints

Total users50,000,000
Concurrent connections~15,000,000
Messages per day~2,000,000,000
Regions5
Max devices per user5
Message delivery (same region)< 200 ms
Message delivery (cross-region)< 1 second
Metadata retention5 years
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Add E2E encryption, cross-org messaging, and scale to 50 M users across regions.
  • Design for a peak load target around 80,000 RPS (including burst headroom).
  • Total users: 50,000,000
  • Concurrent connections: ~15,000,000
  • Messages per day: ~2,000,000,000
  • Regions: 5
  • Max devices per user: 5

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • Geo Distribution: Route users to nearest region/edge while keeping write-consistency boundaries explicit.
  • Replication: Separate primary write path from replicated read path and define lag tolerance per feature.
  • Consistency: Classify operations by consistency requirement: strong for money/inventory, eventual for feeds/analytics.

4) Reliability and Failure Strategy

  • Track connection churn, backpressure, and session resumption behavior.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Use short-lived tokens and secure key rotation workflows.
  • Design region failover and data residency controls as first-class requirements.
  • Monitor replication lag and have failover runbooks with recovery point objectives.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.
  • Geo Distribution: Global latency improves, but cross-region consistency and operations become harder.
  • Replication: Replication improves read scale and DR posture but complicates consistency semantics.

Practical Notes

  • The Signal Protocol (Double Ratchet) is the gold standard for E2EE - study its key management model.
  • Per-device keys mean sending a message to a user with 5 devices requires 5 encrypted copies.
  • Federation introduces the challenge of cross-org key exchange - consider a key transparency log.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Mobile Clients -> DNS -> Load Balancer -> API Gateway -> Core Service -> Auth Service -> Primary NoSQL DB -> Replica SQL DB

Design strengths

  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.