HardEnterprise

Design Spotify

CDNDatabasesCachingMicroservicesStorageAnalytics

Problem Statement

Design the architecture for Spotify - the world's largest audio streaming platform with 615 million monthly active users and a catalog of 100 million tracks. Your design must cover:

- Audio streaming - serve audio files in multiple quality levels (low 24 kbps → very high 320 kbps → lossless 1,411 kbps). The player pre-buffers the next 30 seconds and seamlessly transitions between songs with gapless playback.Content ingestion - artists/labels upload tracks via Spotify for Artists. Each track is transcoded into multiple formats (OGG Vorbis, AAC, FLAC), tagged with metadata, and distributed to CDN edge nodes.Personalization engine - powers Discover Weekly, Release Radar, Daily Mixes, and the home feed. Uses collaborative filtering, content-based analysis (audio features via ML), and contextual signals (time of day, mood, listening history).Search - search across 100 M tracks, artists, albums, podcasts, and playlists with type-ahead autocomplete in < 50 ms.Offline mode - premium users can download playlists for offline listening. DRM (Widevine/FairPlay) protects content.Social features - shared playlists, friend activity feed, collaborative queue ("Group Session").Royalty calculation - for every stream, calculate the fractional royalty owed to artists/labels based on the user's subscription revenue and total streams. This is a massive batch compute job.

The key challenge is low-latency audio streaming with personalization at massive scale, combined with a complex royalty/payment system.

What You'll Learn

Design Spotify's music streaming platform - audio delivery, personalized playlists, offline mode, and social features for 600 M users. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

CDNDatabasesCachingMicroservicesStorageAnalytics

Constraints

Monthly active users615,000,000
Catalog size100,000,000 tracks
Concurrent streams (peak)~30,000,000
Audio start time< 200 ms
Search / autocomplete< 50 ms
Storage (all formats)~100 PB
Daily streams~1,500,000,000
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design Spotify's music streaming platform - audio delivery, personalized playlists, offline mode, and social features for 600 M users.
  • Design for a peak load target around 80,000 RPS (including burst headroom).
  • Monthly active users: 615,000,000
  • Catalog size: 100,000,000 tracks
  • Concurrent streams (peak): ~30,000,000
  • Audio start time: < 200 ms
  • Search / autocomplete: < 50 ms

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • CDN: Serve static and cacheable content from edge and keep origin strictly for misses and dynamic requests.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • Microservices: Split services by business boundary, not by technical layer, and enforce ownership per domain.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • Analytics: Maintain separate OLTP and analytics paths; stream events into a warehouse/time-series layer.

4) Reliability and Failure Strategy

  • Define cache keys and purge workflows before launch to avoid stale/global outages.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Add service-level timeout/retry budgets and contract tests.
  • Enforce lifecycle policies, retention tiers, and checksum validation.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • CDN: Long TTL improves latency/cost; short TTL improves freshness.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • Microservices: Independent deployability improves scale but increases operational/debug complexity.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.

Practical Notes

  • Audio files are relatively small (~3-10 MB each) - CDN edge caching is extremely effective since popular tracks follow a power-law distribution.
  • Pre-buffer strategy: when a user is 80% through a song, start fetching the next track in the queue from CDN.
  • Personalization: train models offline (Spark/Hadoop), serve recommendations from a pre-computed cache (Redis) refreshed daily.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> DNS -> CDN Edge -> Load Balancer -> API Gateway -> Core Service -> Primary SQL DB -> Redis Cache

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Analytics pipeline is separated from OLTP path to avoid reporting workloads impacting transactions.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.