MediumIntermediate

Content Delivery Network

CDNCachingStorageMonitoringGeo Distribution

Problem Statement

EdgeBlast is building a CDN service. Customers configure their domain to route through EdgeBlast, which caches and serves their static content from edge servers worldwide. Features:

- Edge caching - cache static assets (images, JS, CSS, videos) at 50 edge locations worldwide. Cache hit ratio target: > 95%.Cache invalidation - customers can purge specific URLs or entire path prefixes. Purge must propagate to all edges within 30 seconds.Origin shielding - when an edge server has a cache miss, don't all hit the customer's origin. Route misses through a shield layer (a few regional caches) to coalesce requests.HTTP/2 & HTTP/3 - support modern protocols for multiplexing and reduced latency.Analytics - per-domain dashboards showing bandwidth usage, cache hit ratio, latency by edge, and top requested URLs.DDoS protection - basic L7 DDoS mitigation: detect anomalous traffic patterns and drop malicious requests at the edge.Custom rules - customers add headers, enable CORS, set redirect rules, and configure access control via a rules engine.

Serve 1 billion requests per day across all customers with 100 TB of cached content.

What You'll Learn

Design a CDN with edge caching, cache invalidation, origin shielding, and real-time purge support. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

CDNCachingStorageMonitoringGeo Distribution

Constraints

Edge locations~50 globally
Requests/day (total)~1,000,000,000
Cached content~100 TB
Cache hit ratio target> 95%
Purge propagation< 30 seconds globally
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a CDN with edge caching, cache invalidation, origin shielding, and real-time purge support.
  • Design for a peak load target around 57,870 RPS (including burst headroom).
  • Edge locations: ~50 globally
  • Requests/day (total): ~1,000,000,000
  • Cached content: ~100 TB
  • Cache hit ratio target: > 95%
  • Purge propagation: < 30 seconds globally

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • CDN: Serve static and cacheable content from edge and keep origin strictly for misses and dynamic requests.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.
  • Geo Distribution: Route users to nearest region/edge while keeping write-consistency boundaries explicit.

4) Reliability and Failure Strategy

  • Define cache keys and purge workflows before launch to avoid stale/global outages.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Alert on user-impact SLOs, not only infrastructure metrics.
  • Design region failover and data residency controls as first-class requirements.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • CDN: Long TTL improves latency/cost; short TTL improves freshness.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.
  • Geo Distribution: Global latency improves, but cross-region consistency and operations become harder.

Practical Notes

  • Cache key = HTTP method + URL + Vary headers. Use consistent hashing to distribute content across cache servers at each edge.
  • Origin shielding: an edge miss goes to the nearest shield POP. Only the shield talks to the customer's origin - this collapses thundering herds.
  • Purge: publish a purge message to a pub/sub system (Kafka, Redis Streams). All edges subscribe and invalidate matching keys.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> DNS -> CDN Edge -> Load Balancer -> API Service -> Primary SQL DB -> Read Model DB -> Redis Cache

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.