HardSocial Feed · Part 3

Social Feed 3 - Global Platform with Media

CDNMedia ProcessingGeo DistributionShardingMessage QueuesMonitoring

This challenge builds on Social Feed 2 - Going Viral. Complete it first for the best experience.

Problem Statement

Chirper is now a global platform with 500 million users across 100+ countries. It has evolved beyond text to support images and short video clips (up to 60 seconds). Key challenges:

- Media pipeline - users upload images and videos that need to be transcoded into multiple resolutions/formats, stored durably, and served via CDN. The system processes 2 million media uploads per day.Content moderation - AI-based screening must flag harmful content within 30 seconds of upload, before it appears in anyone's feed. False positive rate must be < 1%.Global latency - users in any country should see their timeline within 200 ms. This requires multi-region data centers with intelligent routing.Data sharding - the post database has grown to petabytes. Determine an effective sharding strategy.Observability - with hundreds of microservices, you need distributed tracing, metrics, and alerting to keep the platform healthy.

What You'll Learn

Scale to 500 M users, add video/image uploads, content moderation, and multi-region deployment. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

CDNMedia ProcessingGeo DistributionShardingMessage QueuesMonitoring

Constraints

Registered users500,000,000
Daily active users150,000,000
Media uploads/day~2,000,000
Video length≤ 60 seconds
Moderation latency< 30 seconds
Timeline latency (global)< 200 ms
Storage growth~50 TB/day
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Scale to 500 M users, add video/image uploads, content moderation, and multi-region deployment.
  • Design for a peak load target around 34,722 RPS (including burst headroom).
  • Registered users: 500,000,000
  • Daily active users: 150,000,000
  • Media uploads/day: ~2,000,000
  • Video length: ≤ 60 seconds
  • Moderation latency: < 30 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • CDN: Serve static and cacheable content from edge and keep origin strictly for misses and dynamic requests.
  • Media Processing: Split ingest, transform, and delivery into independent stages with async orchestration.
  • Geo Distribution: Route users to nearest region/edge while keeping write-consistency boundaries explicit.
  • Sharding: Choose shard keys around access patterns and growth hotspots, not just data size.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Define cache keys and purge workflows before launch to avoid stale/global outages.
  • Store original media durably and make transforms replayable.
  • Design region failover and data residency controls as first-class requirements.
  • Support rebalancing and hotspot detection from day one.
  • Guarantee idempotent consumers and trace every message with correlation IDs.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • CDN: Long TTL improves latency/cost; short TTL improves freshness.
  • Media Processing: Pre-processing improves playback UX, but requires substantial compute/storage budget.
  • Geo Distribution: Global latency improves, but cross-region consistency and operations become harder.
  • Sharding: Sharding improves horizontal scale but makes cross-shard queries and transactions harder.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.

Practical Notes

  • An object store (S3) + CDN for media delivery; a transcoding pipeline (FFmpeg workers) behind a queue.
  • Shard posts by user_id to keep a user's data co-located; use a shard map (consistent hashing).
  • ML-based content moderation can run as an async step in the media processing pipeline.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> DNS -> CDN Edge -> Load Balancer -> Core Service -> Primary NoSQL DB -> Replica SQL DB -> Message Queue

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Media processing is handled by background workers so user-facing latency stays low.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.