MediumIntermediate

Food Delivery Backend

DatabasesWebSocketsMessage QueuesCachingAPI Design

Problem Statement

QuickBite is building a food delivery platform (like DoorDash / Deliveroo) operating in a single metro area. The system needs:

- Restaurant catalog - list of 5,000 restaurants with menus, prices, photos, hours, and ratings. Filterable by cuisine, price range, delivery time, and rating.Ordering - users browse menus, add items to cart, customize orders (extra cheese, no onions), and checkout. The order is sent to the restaurant and a delivery driver simultaneously.Driver matching - assign the nearest available driver to pick up the order. Drivers receive the assignment on their app with restaurant and customer details.Real-time tracking - after pickup, the customer sees the driver's live location on a map, with an updated ETA.ETA prediction - estimate delivery time based on food preparation time (per restaurant), distance, and current traffic.Reviews & ratings - rate the restaurant (food quality) and driver (delivery experience) separately.Promotions - coupon codes, "free delivery" campaigns, and loyalty rewards.

Targeting 100,000 orders per day with 5,000 active drivers at peak.

What You'll Learn

Design a food delivery platform with restaurant listings, ordering, real-time driver tracking, and ETA estimates. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesWebSocketsMessage QueuesCachingAPI Design

Constraints

Daily orders~100,000
Active drivers (peak)~5,000
Restaurants~5,000
Order placement latency< 2 seconds
Driver match time< 30 seconds
Location update frequencyEvery 5 seconds
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a food delivery platform with restaurant listings, ordering, real-time driver tracking, and ETA estimates.
  • Design for a peak load target around 100 RPS (including burst headroom).
  • Daily orders: ~100,000
  • Active drivers (peak): ~5,000
  • Restaurants: ~5,000
  • Order placement latency: < 2 seconds
  • Driver match time: < 30 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Track connection churn, backpressure, and session resumption behavior.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Apply strict input validation and backward-compatible versioning.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.

Practical Notes

  • Separate the customer-facing API, restaurant-facing API, and driver-facing API into distinct services - they have different traffic patterns and SLAs.
  • Use a geospatial index (PostGIS or Redis GEO) for 'nearest driver' queries.
  • Order state machine: placed → accepted (restaurant) → preparing → ready → picked up → en route → delivered. Use events to drive transitions.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary NoSQL DB -> Redis Cache -> Message Queue -> Background Workers

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.