Case Study: Design a Notification System

A notification system delivers timely messages to users across multiple channels: push notifications (mobile/web), SMS, email, and in-app alerts. At scale, companies like Facebook send billions of notifications per day. This case study walks through designing a reliable, multi-channel notification platform that handles prioritization, user preferences, rate limiting, and delivery guarantees.

Step 1: Clarify Requirements

Functional Requirements

  • Send notifications via multiple channels: push (iOS/Android/Web), SMS, email, and in-app.
  • Users can configure notification preferences per channel and per notification type (e.g., marketing vs. transactional).
  • Support templated messages with dynamic personalization (e.g., "Hi {{name}}, your order {{order_id}} has shipped").
  • Support scheduled/delayed notifications (e.g., send at 9 AM in the user's timezone).
  • Track delivery status: sent, delivered, opened, clicked, failed.
  • Support bulk notifications (e.g., product announcements to all users).
  • Provide an API for internal services to trigger notifications.

Non-Functional Requirements

  • Reliability: No notification should be silently lost. At-least-once delivery with idempotency.
  • Low latency: Transactional notifications (password reset, OTP) must arrive within seconds.
  • Scalability: Handle 10 billion+ notifications per day across all channels.
  • Rate limiting: Prevent notification fatigue. Respect provider rate limits (e.g., APNs throttling).
  • Extensibility: Adding a new channel (e.g., WhatsApp, Slack) should not require rearchitecting the system.

Step 2: Back-of-Envelope Estimates

EstimationUsers: 500 million registered users Daily active: 100 million DAU Notifications per day: Push: 5 per DAU = 500 million/day Email: 0.5 per user = 250 million/day SMS: 0.1 per user = 50 million/day In-app: 3 per DAU = 300 million/day ---------------------------------------- Total: ~1.1 billion/day Peak throughput: 1.1B / 86,400 sec = ~12,700 notifications/sec (average) Peak (3x average) = ~38,000 notifications/sec Storage (per notification record): ~500 bytes (ID, user_id, channel, template, status, timestamps) 1.1B * 500B = ~550 GB/day Retained 90 days = ~50 TB Message queue throughput: ~38,000 messages/sec at peak Kafka or SQS can handle this comfortably

Step 3: High-Level Design

Internal Services Notification Service validate + enrich Preference Store (Redis + DB) Template Engine Personalization Priority Queue (High priority) Standard Queue (Kafka / SQS) Bulk Queue (Low priority) Dispatch Workers Rate limiter APNs Adapter (iOS Push) FCM Adapter (Android/Web) SMTP Adapter (Email) Twilio Adapter (SMS) In-App Adapter (WebSocket) Analytics Delivery tracking callbacks

The architecture separates concerns into distinct layers: the notification service validates and enriches requests, priority queues order delivery by urgency, dispatch workers enforce rate limits and route to the correct provider adapter, and analytics tracks delivery outcomes via callbacks.

Step 4: Deep Dive

Data Model

The notification system requires three core tables: one for notification records, one for user preferences, and one for device/channel registrations.

SchemaTABLE notifications ( id UUID PRIMARY KEY, user_id BIGINT NOT NULL, channel ENUM('push','sms','email','in_app') NOT NULL, priority ENUM('critical','high','standard','low') DEFAULT 'standard', category VARCHAR(50) NOT NULL, -- 'otp', 'order_update', 'marketing' template_id VARCHAR(100), template_params JSONB, -- {"name":"Alice","order_id":"#1234"} rendered_title TEXT, rendered_body TEXT, status ENUM('queued','sent','delivered','opened','clicked','failed') DEFAULT 'queued', retry_count INT DEFAULT 0, scheduled_at TIMESTAMP, -- NULL = send immediately sent_at TIMESTAMP, delivered_at TIMESTAMP, created_at TIMESTAMP DEFAULT NOW(), idempotency_key VARCHAR(64) UNIQUE -- prevents duplicate sends ); INDEX idx_user_status ON notifications(user_id, status); INDEX idx_scheduled ON notifications(scheduled_at) WHERE status = 'queued'; INDEX idx_idempotency ON notifications(idempotency_key); TABLE user_preferences ( user_id BIGINT NOT NULL, category VARCHAR(50) NOT NULL, -- 'marketing', 'transactional', 'social' channel ENUM('push','sms','email','in_app') NOT NULL, enabled BOOLEAN DEFAULT TRUE, quiet_start TIME, -- do not disturb start quiet_end TIME, -- do not disturb end timezone VARCHAR(40) DEFAULT 'UTC', PRIMARY KEY (user_id, category, channel) ); TABLE device_registrations ( id UUID PRIMARY KEY, user_id BIGINT NOT NULL, channel ENUM('apns','fcm','web_push') NOT NULL, device_token TEXT NOT NULL, platform VARCHAR(20), -- 'ios', 'android', 'web' is_active BOOLEAN DEFAULT TRUE, registered_at TIMESTAMP DEFAULT NOW(), last_used_at TIMESTAMP ); INDEX idx_user_devices ON device_registrations(user_id, is_active);
Idempotency Key

The idempotency_key column is critical for at-least-once delivery. If a service retries sending a notification (e.g., due to a timeout), the notification service checks this key and skips duplicates. The key is typically a hash of (user_id + event_type + event_id + channel).

Rate Limiting & Priority Queues

Notification fatigue is a real problem. Sending too many notifications degrades user experience and increases opt-out rates. Rate limiting operates at two levels:

User-Level Rate Limiting

  • Limit total notifications per user per time window (e.g., max 10 push notifications per hour).
  • Implemented with a Redis sliding window counter: INCR user:{id}:push:count with TTL.
  • Critical notifications (OTP, security alerts) bypass user-level limits.
  • Marketing notifications are the first to be dropped when limits are reached.

Provider-Level Rate Limiting

  • APNs allows ~2,000-4,000 notifications/sec per connection (varies by priority).
  • FCM has per-project limits (~500K messages/sec for large projects).
  • Twilio SMS has per-number throughput limits (1 msg/sec for long codes).
  • Use token bucket rate limiters per provider, with circuit breakers for provider outages.

The queuing system uses three priority tiers to ensure urgent messages are never delayed by bulk sends:

PriorityQueueUse CaseTarget Latency
CriticalHigh-priority queueOTP, security alerts, password resets< 5 seconds
StandardStandard queueOrder updates, social interactions, reminders< 30 seconds
LowBulk queueMarketing campaigns, product announcements, digests< 10 minutes

Delivery Guarantees & Retry Logic

Notifications operate under an at-least-once delivery model. The system must handle failures at every stage: network errors, provider outages, invalid device tokens, and throttling.

1
Dispatch worker pulls a message from the queue and attempts delivery via the appropriate provider adapter.
2
On success, update the notification status to sent. The provider may later send a delivery receipt callback, updating the status to delivered.
3
On transient failure (5xx, timeout, throttling), re-enqueue the message with exponential backoff: retry after 1s, 2s, 4s, 8s, 16s, up to a max of 5 retries.
4
On permanent failure (invalid token, unsubscribed, 4xx), mark as failed and do not retry. For invalid device tokens, mark the device registration as inactive.
5
After max retries exhausted, move to a dead letter queue (DLQ) for manual inspection and alerting.
Retry Logicfunction getRetryDelay(attempt: number): number { // Exponential backoff with jitter const baseDelay = 1000; // 1 second const maxDelay = 60000; // 60 seconds const exponential = baseDelay * Math.pow(2, attempt); const jitter = Math.random() * 1000; return Math.min(exponential + jitter, maxDelay); } // Retry schedule: // Attempt 1: ~1-2 seconds // Attempt 2: ~2-3 seconds // Attempt 3: ~4-5 seconds // Attempt 4: ~8-9 seconds // Attempt 5: ~16-17 seconds // After 5 failures -> Dead Letter Queue
Why Jitter Matters

Without jitter, if a provider goes down and recovers, all retrying workers would hit it simultaneously at the same backoff intervals (the "thundering herd" problem). Adding random jitter spreads retries over time, preventing sudden spikes that could cause the provider to fail again.

Template System & Personalization

Notifications use templates to separate content from logic. A template is defined once and rendered with per-user data at send time.

Template Example// Template definition (stored in template service) { "id": "order_shipped", "channels": { "push": { "title": "Your order is on its way!", "body": "Hi {{user.first_name}}, your order {{order.id}} has shipped via {{order.carrier}}. Track it here." }, "email": { "subject": "Order {{order.id}} Shipped", "html_template": "email/order_shipped.mjml", "plain_text": "Hi {{user.first_name}}, your order {{order.id}} shipped..." }, "sms": { "body": "{{user.first_name}}, order {{order.id}} shipped. Track: {{order.tracking_url}}" } }, "category": "transactional", "default_priority": "standard" } // API call to trigger notification POST /v1/notifications { "user_id": 12345, "template_id": "order_shipped", "params": { "order": { "id": "#A1B2C3", "carrier": "FedEx", "tracking_url": "https://track.example.com/A1B2C3" } }, "channels": ["push", "email"], // override defaults "idempotency_key": "order_shipped:12345:#A1B2C3" }

The template engine resolves channel-specific content, applies user preferences (checking if the user has enabled this category on each channel), and handles localization by selecting the correct language variant based on the user's locale setting.

Analytics: Delivery, Open Rates & Click-Through

The analytics subsystem tracks the full notification lifecycle. Each state transition emits an event to a streaming pipeline (Kafka) for aggregation.

MetricHow It Is TrackedTypical Value
Delivery rateProvider delivery receipts (APNs, FCM callbacks)95-99%
Open rate (push)App reports "notification opened" event via SDK5-15%
Open rate (email)Tracking pixel in email body15-25%
Click-through rateRedirect through tracking URL before landing page2-5%
Unsubscribe rateUnsubscribe link or preference change< 0.5%
Failure rateProvider error responses, DLQ size< 1%
Analytics PipelineNotification Events Flow: Dispatch Worker | | (emit event: sent/failed/retried) v Kafka Topic: "notification.events" | |---> Real-time dashboard (Flink/Spark Streaming) | - Live delivery rates | - Failure alerts (PagerDuty) | |---> Batch analytics (ClickHouse / BigQuery) - Daily/weekly reports - A/B test results for notification copy - Channel effectiveness comparison

Step 5: Scaling & Optimizations

  • Horizontal scaling of workers: Dispatch workers are stateless. Scale them independently per channel based on queue depth. Use auto-scaling groups that respond to queue lag metrics.
  • Database partitioning: Partition the notifications table by created_at (monthly ranges). Old partitions can be archived to cold storage. User preferences are cached aggressively in Redis since they change infrequently.
  • Multi-region deployment: Deploy notification workers close to provider endpoints (e.g., APNs servers are in the US) to reduce network latency. Use regional queues for in-app notifications served via WebSocket.
  • Batching: Group email notifications into digest messages (e.g., "You have 5 new comments") to reduce volume and improve engagement. Use a time-window aggregator before the email adapter.
  • Provider failover: For SMS, maintain multiple providers (Twilio, Vonage, AWS SNS). If one provider fails or is rate-limited, the adapter automatically routes to a backup provider.
  • Scheduled notification handling: A cron-based scheduler scans for notifications where scheduled_at <= NOW() and enqueues them. Use database-level indexing on scheduled_at for efficient polling.
  • Token management: Periodically clean up invalid device tokens reported by APNs/FCM feedback services. Stale tokens waste capacity and increase failure rates.

Architecture Summary

ComponentTechnologyPurpose
Notification ServiceREST/gRPC APIValidate, enrich, route notifications
Template EngineHandlebars / customRender personalized content per channel
Preference StorePostgreSQL + RedisUser opt-in/opt-out, quiet hours, channels
Message QueuesKafka (or SQS)Decouple ingestion from delivery, priority ordering
Dispatch WorkersStateless consumersRate limit, retry, route to provider adapters
Provider AdaptersAPNs, FCM, SMTP, TwilioChannel-specific delivery logic
Analytics PipelineKafka + ClickHouseDelivery tracking, open/click rates, alerting
Dead Letter QueueSQS / Kafka DLQCapture permanently failed messages for review

Key Takeaways

  • A notification system is fundamentally a fan-out problem: one event triggers messages across multiple channels to multiple devices. Design for channel independence so each adapter can scale and fail independently.
  • Priority queues are essential. Critical notifications (OTP, security alerts) must never be delayed by a bulk marketing campaign. Use separate queues per priority tier.
  • Use idempotency keys to prevent duplicate notifications. At-least-once delivery with deduplication is far simpler than exactly-once semantics.
  • Rate limiting at both user and provider levels protects user experience and prevents provider throttling. Always implement exponential backoff with jitter for retries.
  • The adapter pattern makes the system extensible. Adding a new channel (WhatsApp, Slack, in-app WebSocket) means implementing a single adapter interface without changing the core pipeline.

Chapter Check-Up

Quick quiz to reinforce what you just learned.