MediumIntermediate

1-on-1 Video Calling

WebSocketsDatabasesAuthMonitoring

Problem Statement

CallDrop is adding 1-on-1 video calling to an existing messaging app. Unlike large group calls, 1-on-1 calls use peer-to-peer (WebRTC) when possible. The system needs:

- Signaling server - exchange WebRTC offer/answer SDP and ICE candidates between the two peers to establish a direct connection.NAT traversal - use STUN servers to discover public IP/port. When direct connection fails (~20% of cases), fall back to a TURN relay server.Call flow - caller initiates → callee receives push notification or in-app alert → callee accepts → WebRTC connection established → call in progress → either party hangs up.Adaptive quality - automatically adjust video resolution and frame rate based on available bandwidth (measure via RTCP feedback).Call history - log every call: participants, start/end time, duration, and quality metrics (packet loss, jitter).Missed calls - if the callee doesn't answer within 30 seconds, log as missed and send a notification.

Targeting 500,000 concurrent calls at peak with 50 million calls per day.

What You'll Learn

Design a peer-to-peer video calling feature with signaling, NAT traversal, call history, and adaptive quality. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

WebSocketsDatabasesAuthMonitoring

Constraints

Concurrent calls (peak)~500,000
Daily calls~50,000,000
Signaling latency< 1 second
Call setup time< 3 seconds
Peer-to-peer success rate~80%
TURN relay bandwidth~20% of calls
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a peer-to-peer video calling feature with signaling, NAT traversal, call history, and adaptive quality.
  • Design for a peak load target around 75,000 RPS (including burst headroom).
  • Concurrent calls (peak): ~500,000
  • Daily calls: ~50,000,000
  • Signaling latency: < 1 second
  • Call setup time: < 3 seconds
  • Peer-to-peer success rate: ~80%

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Track connection churn, backpressure, and session resumption behavior.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Use short-lived tokens and secure key rotation workflows.
  • Alert on user-impact SLOs, not only infrastructure metrics.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.

Practical Notes

  • The signaling server is lightweight - it just relays SDP/ICE messages. Use WebSockets (already used in the messaging app).
  • Deploy STUN servers globally (they're stateless and cheap). TURN servers are bandwidth-heavy - deploy strategically and cap per-session bandwidth.
  • WebRTC handles encryption (SRTP) natively - no additional E2E encryption needed for the media stream.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Auth Service -> Primary NoSQL DB -> Realtime Bus -> Monitoring

Design strengths

  • Monitoring and logs are wired in from day one for rapid incident triage.
  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.