HardCloud Drive · Part 2

Cloud Drive 2 - Enterprise Collaboration & Compliance

StorageDatabasesSearchAuthMonitoringReplication

This challenge builds on Cloud Drive 1 - Personal File Storage. Complete it first for the best experience.

Problem Statement

SkyVault is now an enterprise platform serving 100 million users across thousands of organizations. New requirements:

- Real-time collaborative editing - multiple users edit documents simultaneously (Google Docs-style). Conflict resolution with Operational Transformation (OT) or CRDTs must keep all clients in sync within 200 ms.Full-text search - search across all files a user has access to (including inside PDFs and Office documents) with results in < 1 second.eDiscovery & legal hold - admins can place a legal hold on a user's files, preventing deletion. All file activity (views, edits, shares, deletes) is audit-logged for compliance.Data Loss Prevention (DLP) - automatically scan uploaded files for sensitive data (SSNs, credit card numbers, API keys). Flag or block sharing of flagged files.Multi-region replication - enterprise customers can choose where their data resides (US, EU, Asia). Data must not leave the chosen region. Cross-region disaster recovery with RPO < 1 minute.

This challenge combines storage at scale, real-time systems, search, and enterprise compliance.

What You'll Learn

Scale to 100 M users with real-time co-editing, eDiscovery, data loss prevention, and multi-region replication. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

StorageDatabasesSearchAuthMonitoringReplication

Constraints

Total users100,000,000
Total storage~500 PB
Concurrent editors (per doc)Up to 100
Collaboration sync latency< 200 ms
Search latency< 1 second
DLP scan latency< 60 seconds per file
Regions3 (US, EU, Asia)
RPO (disaster recovery)< 1 minute
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Scale to 100 M users with real-time co-editing, eDiscovery, data loss prevention, and multi-region replication.
  • Design for a peak load target around 6,944 RPS (including burst headroom).
  • Total users: 100,000,000
  • Total storage: ~500 PB
  • Concurrent editors (per doc): Up to 100
  • Collaboration sync latency: < 200 ms
  • Search latency: < 1 second

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Search: Use primary store for writes and async index updates for search relevance + scale.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.
  • Replication: Separate primary write path from replicated read path and define lag tolerance per feature.

4) Reliability and Failure Strategy

  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Track indexing lag and support reindex from source of truth.
  • Use short-lived tokens and secure key rotation workflows.
  • Alert on user-impact SLOs, not only infrastructure metrics.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Search: Search index gives rich querying but introduces eventual consistency and index ops overhead.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.

Practical Notes

  • CRDTs (Conflict-free Replicated Data Types) are simpler to reason about than OT for collaborative editing at scale.
  • Index file contents with Elasticsearch - use Apache Tika to extract text from PDFs, DOCX, etc.
  • DLP scanning can be an async pipeline - upload → object store → DLP scanner → flag/approve → make available.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> Core Service -> Auth Service -> Primary SQL DB -> Read Model DB -> Object Storage

Design strengths

  • Monitoring and logs are wired in from day one for rapid incident triage.
  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.