MediumNotification System · Part 1

Notification System - Multi-Channel Delivery

Message QueuesDatabasesAPI DesignNotificationsMonitoring

Problem Statement

PingHub is building a centralized notification service used by all product teams at a large tech company. Instead of each team building their own email/push/SMS integration, PingHub provides a single API:

`POST /notify { userId, channel, template, data }`

Core requirements:

- Multi-channel delivery - support push notifications (iOS/Android), email, SMS, and in-app (WebSocket). Each user has channel preferences (e.g., "send me emails but not SMS").Template engine - notifications are rendered from templates with variable substitution ("Hi {{name}}, your order #{{orderId}} shipped!").Delivery guarantees - at-least-once delivery with deduplication. Track delivery status (sent, delivered, opened, failed) per message.Rate limiting & batching - don't spam users. Aggregate multiple notifications of the same type into a digest (e.g., "You have 5 new followers" instead of 5 separate notifications).Priority levels - critical notifications (security alerts, OTP codes) bypass batching and are delivered immediately.Observability - real-time dashboards showing delivery rates, failure rates, and latency per channel.

PingHub handles 100 million notifications per day across all channels.

What You'll Learn

Build a platform-wide notification service delivering 100 M messages/day across push, email, SMS, and in-app. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

Message QueuesDatabasesAPI DesignNotificationsMonitoring

Constraints

Daily notifications~100,000,000
Channels4 (push, email, SMS, in-app)
Critical delivery latency< 5 seconds
Normal delivery latency< 5 minutes
Digest window15 minutes
Delivery trackingPer-message status
Template count~500
Availability target99.95%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Build a platform-wide notification service delivering 100 M messages/day across push, email, SMS, and in-app.
  • Design for a peak load target around 5,787 RPS (including burst headroom).
  • Daily notifications: ~100,000,000
  • Channels: 4 (push, email, SMS, in-app)
  • Critical delivery latency: < 5 seconds
  • Normal delivery latency: < 5 minutes
  • Digest window: 15 minutes

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Notifications: Model notifications as event-driven fanout with per-channel workers (email/push/webhook).
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Apply strict input validation and backward-compatible versioning.
  • Track delivery state machine and dead-letter undeliverable events.
  • Alert on user-impact SLOs, not only infrastructure metrics.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Notifications: Multi-channel coverage increases reach but adds per-channel failure modes and policy complexity.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.

Practical Notes

  • Use a priority queue with at least two tiers - critical notifications skip batching and go to the fast lane.
  • A fan-out service reads user preferences and routes each notification to the appropriate channel worker.
  • Deduplication via a short-lived cache of (userId + notificationType + timeWindow) keys prevents spam.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary NoSQL DB -> Message Queue -> Background Workers -> Notification Fanout

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.