Design a Notification System | System Design Case Study

Step 1: Clarify Requirements

Functional Requirements

Send notifications via multiple channels: push (iOS/Android/Web), SMS, email, and in-app.
Users can configure notification preferences per channel and per notification type (e.g., marketing vs. transactional).
Support templated messages with dynamic personalization (e.g., "Hi {{name}}, your order {{order_id}} has shipped").
Support scheduled/delayed notifications (e.g., send at 9 AM in the user's timezone).
Track delivery status: sent, delivered, opened, clicked, failed.
Support bulk notifications (e.g., product announcements to all users).
Provide an API for internal services to trigger notifications.

Non-Functional Requirements

Reliability: No notification should be silently lost. At-least-once delivery with idempotency.
Low latency: Transactional notifications (password reset, OTP) must arrive within seconds.
Scalability: Handle 10 billion+ notifications per day across all channels.
Rate limiting: Prevent notification fatigue. Respect provider rate limits (e.g., APNs throttling).
Extensibility: Adding a new channel (e.g., WhatsApp, Slack) should not require rearchitecting the system.

Step 2: Back-of-Envelope Estimates

EstimationUsers:           500 million registered users
Daily active:    100 million DAU

Notifications per day:
  Push:    5 per DAU      = 500 million/day
  Email:   0.5 per user   = 250 million/day
  SMS:     0.1 per user   =  50 million/day
  In-app:  3 per DAU      = 300 million/day
  ----------------------------------------
  Total:                  ~1.1 billion/day

Peak throughput:
  1.1B / 86,400 sec      = ~12,700 notifications/sec (average)
  Peak (3x average)      = ~38,000 notifications/sec

Storage (per notification record):
  ~500 bytes (ID, user_id, channel, template, status, timestamps)
  1.1B * 500B            = ~550 GB/day
  Retained 90 days       = ~50 TB

Message queue throughput:
  ~38,000 messages/sec at peak
  Kafka or SQS can handle this comfortably

Step 3: High-Level Design

The architecture separates concerns into distinct layers: the notification service validates and enriches requests, priority queues order delivery by urgency, dispatch workers enforce rate limits and route to the correct provider adapter, and analytics tracks delivery outcomes via callbacks.

Step 4: Deep Dive

Data Model

The notification system requires three core tables: one for notification records, one for user preferences, and one for device/channel registrations.

SchemaTABLE notifications (
    id              UUID        PRIMARY KEY,
    user_id         BIGINT      NOT NULL,
    channel         ENUM('push','sms','email','in_app') NOT NULL,
    priority        ENUM('critical','high','standard','low') DEFAULT 'standard',
    category        VARCHAR(50) NOT NULL,          -- 'otp', 'order_update', 'marketing'
    template_id     VARCHAR(100),
    template_params JSONB,                         -- {"name":"Alice","order_id":"#1234"}
    rendered_title  TEXT,
    rendered_body   TEXT,
    status          ENUM('queued','sent','delivered','opened','clicked','failed') DEFAULT 'queued',
    retry_count     INT         DEFAULT 0,
    scheduled_at    TIMESTAMP,                     -- NULL = send immediately
    sent_at         TIMESTAMP,
    delivered_at    TIMESTAMP,
    created_at      TIMESTAMP   DEFAULT NOW(),
    idempotency_key VARCHAR(64) UNIQUE             -- prevents duplicate sends
);

INDEX idx_user_status ON notifications(user_id, status);
INDEX idx_scheduled   ON notifications(scheduled_at) WHERE status = 'queued';
INDEX idx_idempotency ON notifications(idempotency_key);

TABLE user_preferences (
    user_id         BIGINT      NOT NULL,
    category        VARCHAR(50) NOT NULL,          -- 'marketing', 'transactional', 'social'
    channel         ENUM('push','sms','email','in_app') NOT NULL,
    enabled         BOOLEAN     DEFAULT TRUE,
    quiet_start     TIME,                          -- do not disturb start
    quiet_end       TIME,                          -- do not disturb end
    timezone        VARCHAR(40) DEFAULT 'UTC',
    PRIMARY KEY (user_id, category, channel)
);

TABLE device_registrations (
    id              UUID        PRIMARY KEY,
    user_id         BIGINT      NOT NULL,
    channel         ENUM('apns','fcm','web_push') NOT NULL,
    device_token    TEXT        NOT NULL,
    platform        VARCHAR(20),                   -- 'ios', 'android', 'web'
    is_active       BOOLEAN     DEFAULT TRUE,
    registered_at   TIMESTAMP   DEFAULT NOW(),
    last_used_at    TIMESTAMP
);

INDEX idx_user_devices ON device_registrations(user_id, is_active);

Idempotency Key

The idempotency_key column is critical for at-least-once delivery. If a service retries sending a notification (e.g., due to a timeout), the notification service checks this key and skips duplicates. The key is typically a hash of (user_id + event_type + event_id + channel).

Rate Limiting & Priority Queues

Notification fatigue is a real problem. Sending too many notifications degrades user experience and increases opt-out rates. Rate limiting operates at two levels:

User-Level Rate Limiting

Limit total notifications per user per time window (e.g., max 10 push notifications per hour).
Implemented with a Redis sliding window counter: INCR user:{id}:push:count with TTL.
Critical notifications (OTP, security alerts) bypass user-level limits.
Marketing notifications are the first to be dropped when limits are reached.

Provider-Level Rate Limiting

APNs allows ~2,000-4,000 notifications/sec per connection (varies by priority).
FCM has per-project limits (~500K messages/sec for large projects).
Twilio SMS has per-number throughput limits (1 msg/sec for long codes).
Use token bucket rate limiters per provider, with circuit breakers for provider outages.

The queuing system uses three priority tiers to ensure urgent messages are never delayed by bulk sends:

Priority	Queue	Use Case	Target Latency
Critical	High-priority queue	OTP, security alerts, password resets	< 5 seconds
Standard	Standard queue	Order updates, social interactions, reminders	< 30 seconds
Low	Bulk queue	Marketing campaigns, product announcements, digests	< 10 minutes

Delivery Guarantees & Retry Logic

Notifications operate under an at-least-once delivery model. The system must handle failures at every stage: network errors, provider outages, invalid device tokens, and throttling.

Dispatch worker pulls a message from the queue and attempts delivery via the appropriate provider adapter.

On success, update the notification status to sent. The provider may later send a delivery receipt callback, updating the status to delivered.

On transient failure (5xx, timeout, throttling), re-enqueue the message with exponential backoff: retry after 1s, 2s, 4s, 8s, 16s, up to a max of 5 retries.

On permanent failure (invalid token, unsubscribed, 4xx), mark as failed and do not retry. For invalid device tokens, mark the device registration as inactive.

After max retries exhausted, move to a dead letter queue (DLQ) for manual inspection and alerting.

Retry Logicfunction getRetryDelay(attempt: number): number {
  // Exponential backoff with jitter
  const baseDelay = 1000;  // 1 second
  const maxDelay  = 60000; // 60 seconds
  const exponential = baseDelay * Math.pow(2, attempt);
  const jitter = Math.random() * 1000;
  return Math.min(exponential + jitter, maxDelay);
}

// Retry schedule:
// Attempt 1:  ~1-2 seconds
// Attempt 2:  ~2-3 seconds
// Attempt 3:  ~4-5 seconds
// Attempt 4:  ~8-9 seconds
// Attempt 5:  ~16-17 seconds
// After 5 failures -> Dead Letter Queue

Why Jitter Matters

Without jitter, if a provider goes down and recovers, all retrying workers would hit it simultaneously at the same backoff intervals (the "thundering herd" problem). Adding random jitter spreads retries over time, preventing sudden spikes that could cause the provider to fail again.

Template System & Personalization

Notifications use templates to separate content from logic. A template is defined once and rendered with per-user data at send time.

Template Example// Template definition (stored in template service)
{
  "id": "order_shipped",
  "channels": {
    "push": {
      "title": "Your order is on its way!",
      "body": "Hi {{user.first_name}}, your order {{order.id}} has shipped via {{order.carrier}}. Track it here."
    },
    "email": {
      "subject": "Order {{order.id}} Shipped",
      "html_template": "email/order_shipped.mjml",
      "plain_text": "Hi {{user.first_name}}, your order {{order.id}} shipped..."
    },
    "sms": {
      "body": "{{user.first_name}}, order {{order.id}} shipped. Track: {{order.tracking_url}}"
    }
  },
  "category": "transactional",
  "default_priority": "standard"
}

// API call to trigger notification
POST /v1/notifications
{
  "user_id": 12345,
  "template_id": "order_shipped",
  "params": {
    "order": {
      "id": "#A1B2C3",
      "carrier": "FedEx",
      "tracking_url": "https://track.example.com/A1B2C3"
    }
  },
  "channels": ["push", "email"],   // override defaults
  "idempotency_key": "order_shipped:12345:#A1B2C3"
}

The template engine resolves channel-specific content, applies user preferences (checking if the user has enabled this category on each channel), and handles localization by selecting the correct language variant based on the user's locale setting.

Analytics: Delivery, Open Rates & Click-Through

The analytics subsystem tracks the full notification lifecycle. Each state transition emits an event to a streaming pipeline (Kafka) for aggregation.

Metric	How It Is Tracked	Typical Value
Delivery rate	Provider delivery receipts (APNs, FCM callbacks)	95-99%
Open rate (push)	App reports "notification opened" event via SDK	5-15%
Open rate (email)	Tracking pixel in email body	15-25%
Click-through rate	Redirect through tracking URL before landing page	2-5%
Unsubscribe rate	Unsubscribe link or preference change	< 0.5%
Failure rate	Provider error responses, DLQ size	< 1%

Analytics PipelineNotification Events Flow:

  Dispatch Worker
       |
       | (emit event: sent/failed/retried)
       v
  Kafka Topic: "notification.events"
       |
       |---> Real-time dashboard (Flink/Spark Streaming)
       |         - Live delivery rates
       |         - Failure alerts (PagerDuty)
       |
       |---> Batch analytics (ClickHouse / BigQuery)
              - Daily/weekly reports
              - A/B test results for notification copy
              - Channel effectiveness comparison

Step 5: Scaling & Optimizations

Horizontal scaling of workers: Dispatch workers are stateless. Scale them independently per channel based on queue depth. Use auto-scaling groups that respond to queue lag metrics.
Database partitioning: Partition the notifications table by created_at (monthly ranges). Old partitions can be archived to cold storage. User preferences are cached aggressively in Redis since they change infrequently.
Multi-region deployment: Deploy notification workers close to provider endpoints (e.g., APNs servers are in the US) to reduce network latency. Use regional queues for in-app notifications served via WebSocket.
Batching: Group email notifications into digest messages (e.g., "You have 5 new comments") to reduce volume and improve engagement. Use a time-window aggregator before the email adapter.
Provider failover: For SMS, maintain multiple providers (Twilio, Vonage, AWS SNS). If one provider fails or is rate-limited, the adapter automatically routes to a backup provider.
Scheduled notification handling: A cron-based scheduler scans for notifications where scheduled_at <= NOW() and enqueues them. Use database-level indexing on scheduled_at for efficient polling.
Token management: Periodically clean up invalid device tokens reported by APNs/FCM feedback services. Stale tokens waste capacity and increase failure rates.

Architecture Summary

Component	Technology	Purpose
Notification Service	REST/gRPC API	Validate, enrich, route notifications
Template Engine	Handlebars / custom	Render personalized content per channel
Preference Store	PostgreSQL + Redis	User opt-in/opt-out, quiet hours, channels
Message Queues	Kafka (or SQS)	Decouple ingestion from delivery, priority ordering
Dispatch Workers	Stateless consumers	Rate limit, retry, route to provider adapters
Provider Adapters	APNs, FCM, SMTP, Twilio	Channel-specific delivery logic
Analytics Pipeline	Kafka + ClickHouse	Delivery tracking, open/click rates, alerting
Dead Letter Queue	SQS / Kafka DLQ	Capture permanently failed messages for review

Key Takeaways

A notification system is fundamentally a fan-out problem: one event triggers messages across multiple channels to multiple devices. Design for channel independence so each adapter can scale and fail independently.
Priority queues are essential. Critical notifications (OTP, security alerts) must never be delayed by a bulk marketing campaign. Use separate queues per priority tier.
Use idempotency keys to prevent duplicate notifications. At-least-once delivery with deduplication is far simpler than exactly-once semantics.
Rate limiting at both user and provider levels protects user experience and prevents provider throttling. Always implement exponential backoff with jitter for retries.
The adapter pattern makes the system extensible. Adding a new channel (WhatsApp, Slack, in-app WebSocket) means implementing a single adapter interface without changing the core pipeline.

Case Study: Design a Notification System