HardEnterprise

Design Google Docs

WebSocketsDatabasesConsistencyStorageAuthAPI Design

Problem Statement

Design the architecture for Google Docs - the world's most popular collaborative document editor with over 1 billion users. Multiple users can edit the same document simultaneously and see each other's changes in real time. Your design must cover:

- Real-time collaborative editing - multiple cursors, real-time character-by-character updates visible to all editors within 100 ms. The system must handle up to 100 concurrent editors on a single document.Conflict resolution - when two users type at the same position simultaneously, the system must merge changes without losing either edit. Implement Operational Transformation (OT) or CRDTs for deterministic conflict resolution.Document storage - documents are stored as structured data (not flat text) supporting rich formatting (bold, italic, headers, tables, images). Autosave every few seconds.Version history - full revision history with the ability to view, compare, and restore any previous version. Track who made each change and when (like git blame).Comments & suggestions - inline comments anchored to text ranges, suggestion mode (track changes), and threaded discussions.Access control - owner, editor, commenter, and viewer roles. Share via link or email invitation. Organization-wide sharing policies.Offline editing - users can edit documents offline. Changes sync and merge when connectivity is restored.Real-time cursors & presence - see other editors' cursor positions, selections, and avatars in real time.

The core challenge is implementing a real-time collaboration engine that maintains consistency across distributed clients with low latency.

What You'll Learn

Design Google Docs - real-time collaborative editing, conflict resolution, version history, and offline support for 1 B+ users. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

WebSocketsDatabasesConsistencyStorageAuthAPI Design

Constraints

Total users1,000,000,000+
Active documents (concurrent)~50,000,000
Max concurrent editors/doc100
Edit propagation latency< 100 ms
Autosave interval~3 seconds
Revision history depthUnlimited
Max document size50 MB
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design Google Docs - real-time collaborative editing, conflict resolution, version history, and offline support for 1 B+ users.
  • Design for a peak load target around 80,000 RPS (including burst headroom).
  • Total users: 1,000,000,000+
  • Active documents (concurrent): ~50,000,000
  • Max concurrent editors/doc: 100
  • Edit propagation latency: < 100 ms
  • Autosave interval: ~3 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Consistency: Classify operations by consistency requirement: strong for money/inventory, eventual for feeds/analytics.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.

4) Reliability and Failure Strategy

  • Track connection churn, backpressure, and session resumption behavior.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Use idempotency keys and conflict-resolution rules on retried/distributed writes.
  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Use short-lived tokens and secure key rotation workflows.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Consistency: Stronger consistency improves correctness, but often increases latency and coordination costs.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.

Practical Notes

  • OT (Operational Transformation): a central server transforms concurrent operations to maintain consistency. Google Docs uses OT - simpler centralized model but requires a single source of truth per document.
  • Each document is managed by a single 'document session server' - all edits for that doc route to the same server for OT processing.
  • Store the document as a log of operations (event sourcing). Periodically snapshot the current state to avoid replaying the entire history.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> Core Service -> Auth Service -> Primary NoSQL DB -> Replica SQL DB -> Realtime Bus

Design strengths

  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.