MediumIntermediate

Collaborative Whiteboard

WebSocketsDatabasesStorageAPI Design

Problem Statement

BoardSync is building a real-time collaborative whiteboard. Multiple users open the same board and draw simultaneously. Features:

- Drawing tools - freehand pen, shapes (rectangles, circles, arrows), text boxes, sticky notes, and image uploads.Real-time sync - all changes appear on other users' screens within 200 ms. Users see each other's cursors moving in real time.Canvas - an infinite canvas that users pan and zoom. Objects are positioned on a coordinate system.Undo/redo - per-user undo/redo stack. Undoing your action shouldn't undo someone else's.Export - export the board as PNG, SVG, or PDF.Templates - pre-built templates for flowcharts, mind maps, Kanban boards, and retrospectives.Permissions - board owner controls who can view, edit, or comment.

Targeting 100,000 boards with up to 50 concurrent editors per board and 500,000 total DAU.

What You'll Learn

Design a real-time collaborative whiteboard (like Miro/Excalidraw) with drawing, shapes, sticky notes, and multi-user cursors. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

WebSocketsDatabasesStorageAPI Design

Constraints

Active boards~100,000
Concurrent editors per boardUp to 50
Daily active users~500,000
Sync latency< 200 ms
Canvas objects per boardUp to 10,000
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a real-time collaborative whiteboard (like Miro/Excalidraw) with drawing, shapes, sticky notes, and multi-user cursors.
  • Design for a peak load target around 100 RPS (including burst headroom).
  • Active boards: ~100,000
  • Concurrent editors per board: Up to 50
  • Daily active users: ~500,000
  • Sync latency: < 200 ms
  • Canvas objects per board: Up to 10,000

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.

4) Reliability and Failure Strategy

  • Track connection churn, backpressure, and session resumption behavior.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Apply strict input validation and backward-compatible versioning.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.

Practical Notes

  • Use CRDTs (Conflict-free Replicated Data Types) for the board state - each shape is an independent CRDT object. Yjs or Automerge are proven libraries.
  • WebSocket per active board. All editors of the same board connect to the same server (or use a pub/sub relay for multi-server).
  • Store the board as a set of objects with (id, type, position, style, z-index). Sync operations are insert/update/delete on individual objects.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary NoSQL DB -> Realtime Bus -> Object Storage

Design strengths

  • The architecture keeps synchronous paths short and isolates stateful dependencies behind clear boundaries.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.