HardEnterprise

Design Amazon

DatabasesCachingMicroservicesSearchMessage QueuesSharding

Problem Statement

Design the architecture for Amazon's e-commerce platform - the world's largest online marketplace with 300+ million active customers, 12 million products (first-party), and 350+ million products including third-party sellers. Your design must cover:

- Product catalog - a massive, hierarchical catalog with categories, filters, variants (size, color), pricing, inventory, and seller information. The catalog is updated millions of times per day by sellers.Search & browse - product search with filters, faceted navigation, spell correction, and "did you mean" suggestions. Search must handle 50,000 queries per second at peak. Results ranked by relevance, price, reviews, and sponsored placement.Shopping cart - a highly available cart service. Cart must survive server failures - users should never lose items from their cart. (Amazon's Dynamo paper was born from this requirement.)Checkout & payment - a multi-step checkout (address → shipping → payment → confirm) that processes thousands of orders per second. Inventory must be reserved atomically to prevent overselling.Recommendation engine - "Customers who bought X also bought Y", personalized home page, and "frequently bought together." Powers ~35% of Amazon's revenue.Order fulfillment - once an order is placed, route it to the optimal warehouse, generate a pick list, and hand off to shipping. Multiple items in one order may ship from different warehouses.Reviews & ratings - user-generated reviews with abuse detection, helpful votes, and verified purchase badges.

This challenge tests your ability to decompose a massive monolith into microservices with clear ownership boundaries.

What You'll Learn

Design Amazon's e-commerce platform - product catalog, cart, checkout, recommendations, and warehouse fulfillment at 300 M+ users. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesCachingMicroservicesSearchMessage QueuesSharding

Constraints

Active customers300,000,000+
Products (total)350,000,000+
Peak orders/second~10,000
Search QPS (peak)~50,000
Search latency (P99)< 300 ms
Cart availability99.999%
Warehouses200+
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design Amazon's e-commerce platform - product catalog, cart, checkout, recommendations, and warehouse fulfillment at 300 M+ users.
  • Design for a peak load target around 20,000 RPS (including burst headroom).
  • Active customers: 300,000,000+
  • Products (total): 350,000,000+
  • Peak orders/second: ~10,000
  • Search QPS (peak): ~50,000
  • Search latency (P99): < 300 ms

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • Microservices: Split services by business boundary, not by technical layer, and enforce ownership per domain.
  • Search: Use primary store for writes and async index updates for search relevance + scale.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Sharding: Choose shard keys around access patterns and growth hotspots, not just data size.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Add service-level timeout/retry budgets and contract tests.
  • Track indexing lag and support reindex from source of truth.
  • Guarantee idempotent consumers and trace every message with correlation IDs.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • Microservices: Independent deployability improves scale but increases operational/debug complexity.
  • Search: Search index gives rich querying but introduces eventual consistency and index ops overhead.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.

Practical Notes

  • Decompose into microservices: Catalog Service, Cart Service, Order Service, Payment Service, Inventory Service, Search Service, Recommendation Service, Fulfillment Service.
  • Cart: use a Dynamo-style AP system (always-writeable) - merge conflicts with 'union' strategy (no item is ever silently lost).
  • Inventory: two-phase approach - soft-reserve on 'add to cart', hard-reserve on checkout. Use pessimistic locking for the final decrement.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> Core Service -> Primary NoSQL DB -> Redis Cache -> Message Queue -> Background Workers

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.