MediumIntermediate

Image Optimization Service

StorageCDNAPI DesignCachingMedia Processing

Problem Statement

ImgFast is building an image optimization service (like Cloudinary / imgix). Website owners point their image URLs through ImgFast, which dynamically transforms images on the fly. Features:

- On-the-fly transformations - resize, crop, rotate, blur, watermark, and change format via URL parameters (e.g., `/img/photo.jpg?w=400&h=300&format=webp&quality=80`).Format auto-detection - serve WebP to Chrome, AVIF to supported browsers, JPEG to others (based on Accept header).Responsive images - generate srcset variants automatically for different screen sizes.CDN caching - cache transformed images at edge PoPs. The same transformation should never be computed twice.Origin pull - fetch the original image from the customer's storage (S3, GCS, or any HTTP URL) on first request.Smart cropping - AI-based face/subject detection to center the crop on the important part of the image.Usage analytics - bandwidth saved, transformations per day, and cache hit ratio per customer.

Process 50 million image requests per day with 10 million unique transformations.

What You'll Learn

Design an image optimization CDN that resizes, compresses, and converts images on-the-fly for web performance. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

StorageCDNAPI DesignCachingMedia Processing

Constraints

Requests/day~50,000,000
Unique transformations/day~10,000,000
Transformation latency (cache miss)< 1 second
Cache hit ratio> 90%
Edge PoPs~30
Max original image size50 MB
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design an image optimization CDN that resizes, compresses, and converts images on-the-fly for web performance.
  • Design for a peak load target around 2,894 RPS (including burst headroom).
  • Requests/day: ~50,000,000
  • Unique transformations/day: ~10,000,000
  • Transformation latency (cache miss): < 1 second
  • Cache hit ratio: > 90%
  • Edge PoPs: ~30

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • CDN: Serve static and cacheable content from edge and keep origin strictly for misses and dynamic requests.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • Media Processing: Split ingest, transform, and delivery into independent stages with async orchestration.

4) Reliability and Failure Strategy

  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Define cache keys and purge workflows before launch to avoid stale/global outages.
  • Apply strict input validation and backward-compatible versioning.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Store original media durably and make transforms replayable.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • CDN: Long TTL improves latency/cost; short TTL improves freshness.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • Media Processing: Pre-processing improves playback UX, but requires substantial compute/storage budget.

Practical Notes

  • Cache key = hash(origin_url + transformation_params + Accept header). Store transformed images in a tiered cache (memory → SSD → object store).
  • Process images with libvips (fast C library) - 10× faster than ImageMagick for most operations.
  • Origin shielding: edge miss → shield cache → worker (transform) → origin. Shield collapses duplicate requests for the same transformation.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> DNS -> CDN Edge -> Load Balancer -> API Gateway -> API Service -> Primary SQL DB -> Redis Cache

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Media processing is handled by background workers so user-facing latency stays low.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.