Step 1: Clarify Requirements
Functional Requirements
- Send notifications via multiple channels: push (iOS/Android/Web), SMS, email, and in-app.
- Users can configure notification preferences per channel and per notification type (e.g., marketing vs. transactional).
- Support templated messages with dynamic personalization (e.g., "Hi {{name}}, your order {{order_id}} has shipped").
- Support scheduled/delayed notifications (e.g., send at 9 AM in the user's timezone).
- Track delivery status: sent, delivered, opened, clicked, failed.
- Support bulk notifications (e.g., product announcements to all users).
- Provide an API for internal services to trigger notifications.
Non-Functional Requirements
- Reliability: No notification should be silently lost. At-least-once delivery with idempotency.
- Low latency: Transactional notifications (password reset, OTP) must arrive within seconds.
- Scalability: Handle 10 billion+ notifications per day across all channels.
- Rate limiting: Prevent notification fatigue. Respect provider rate limits (e.g., APNs throttling).
- Extensibility: Adding a new channel (e.g., WhatsApp, Slack) should not require rearchitecting the system.
Step 2: Back-of-Envelope Estimates
Users: 500 million registered users
Daily active: 100 million DAU
Notifications per day:
Push: 5 per DAU = 500 million/day
Email: 0.5 per user = 250 million/day
SMS: 0.1 per user = 50 million/day
In-app: 3 per DAU = 300 million/day
----------------------------------------
Total: ~1.1 billion/day
Peak throughput:
1.1B / 86,400 sec = ~12,700 notifications/sec (average)
Peak (3x average) = ~38,000 notifications/sec
Storage (per notification record):
~500 bytes (ID, user_id, channel, template, status, timestamps)
1.1B * 500B = ~550 GB/day
Retained 90 days = ~50 TB
Message queue throughput:
~38,000 messages/sec at peak
Kafka or SQS can handle this comfortablyStep 3: High-Level Design
The architecture separates concerns into distinct layers: the notification service validates and enriches requests, priority queues order delivery by urgency, dispatch workers enforce rate limits and route to the correct provider adapter, and analytics tracks delivery outcomes via callbacks.
Step 4: Deep Dive
Data Model
The notification system requires three core tables: one for notification records, one for user preferences, and one for device/channel registrations.
TABLE notifications (
id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
channel ENUM('push','sms','email','in_app') NOT NULL,
priority ENUM('critical','high','standard','low') DEFAULT 'standard',
category VARCHAR(50) NOT NULL, -- 'otp', 'order_update', 'marketing'
template_id VARCHAR(100),
template_params JSONB, -- {"name":"Alice","order_id":"#1234"}
rendered_title TEXT,
rendered_body TEXT,
status ENUM('queued','sent','delivered','opened','clicked','failed') DEFAULT 'queued',
retry_count INT DEFAULT 0,
scheduled_at TIMESTAMP, -- NULL = send immediately
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW(),
idempotency_key VARCHAR(64) UNIQUE -- prevents duplicate sends
);
INDEX idx_user_status ON notifications(user_id, status);
INDEX idx_scheduled ON notifications(scheduled_at) WHERE status = 'queued';
INDEX idx_idempotency ON notifications(idempotency_key);
TABLE user_preferences (
user_id BIGINT NOT NULL,
category VARCHAR(50) NOT NULL, -- 'marketing', 'transactional', 'social'
channel ENUM('push','sms','email','in_app') NOT NULL,
enabled BOOLEAN DEFAULT TRUE,
quiet_start TIME, -- do not disturb start
quiet_end TIME, -- do not disturb end
timezone VARCHAR(40) DEFAULT 'UTC',
PRIMARY KEY (user_id, category, channel)
);
TABLE device_registrations (
id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
channel ENUM('apns','fcm','web_push') NOT NULL,
device_token TEXT NOT NULL,
platform VARCHAR(20), -- 'ios', 'android', 'web'
is_active BOOLEAN DEFAULT TRUE,
registered_at TIMESTAMP DEFAULT NOW(),
last_used_at TIMESTAMP
);
INDEX idx_user_devices ON device_registrations(user_id, is_active);The idempotency_key column is critical for at-least-once delivery. If a service retries sending a notification (e.g., due to a timeout), the notification service checks this key and skips duplicates. The key is typically a hash of (user_id + event_type + event_id + channel).
Rate Limiting & Priority Queues
Notification fatigue is a real problem. Sending too many notifications degrades user experience and increases opt-out rates. Rate limiting operates at two levels:
User-Level Rate Limiting
- Limit total notifications per user per time window (e.g., max 10 push notifications per hour).
- Implemented with a Redis sliding window counter:
INCR user:{id}:push:countwith TTL. - Critical notifications (OTP, security alerts) bypass user-level limits.
- Marketing notifications are the first to be dropped when limits are reached.
Provider-Level Rate Limiting
- APNs allows ~2,000-4,000 notifications/sec per connection (varies by priority).
- FCM has per-project limits (~500K messages/sec for large projects).
- Twilio SMS has per-number throughput limits (1 msg/sec for long codes).
- Use token bucket rate limiters per provider, with circuit breakers for provider outages.
The queuing system uses three priority tiers to ensure urgent messages are never delayed by bulk sends:
| Priority | Queue | Use Case | Target Latency |
|---|---|---|---|
| Critical | High-priority queue | OTP, security alerts, password resets | < 5 seconds |
| Standard | Standard queue | Order updates, social interactions, reminders | < 30 seconds |
| Low | Bulk queue | Marketing campaigns, product announcements, digests | < 10 minutes |
Delivery Guarantees & Retry Logic
Notifications operate under an at-least-once delivery model. The system must handle failures at every stage: network errors, provider outages, invalid device tokens, and throttling.
sent. The provider may later send a delivery receipt callback, updating the status to delivered.failed and do not retry. For invalid device tokens, mark the device registration as inactive.function getRetryDelay(attempt: number): number {
// Exponential backoff with jitter
const baseDelay = 1000; // 1 second
const maxDelay = 60000; // 60 seconds
const exponential = baseDelay * Math.pow(2, attempt);
const jitter = Math.random() * 1000;
return Math.min(exponential + jitter, maxDelay);
}
// Retry schedule:
// Attempt 1: ~1-2 seconds
// Attempt 2: ~2-3 seconds
// Attempt 3: ~4-5 seconds
// Attempt 4: ~8-9 seconds
// Attempt 5: ~16-17 seconds
// After 5 failures -> Dead Letter QueueWithout jitter, if a provider goes down and recovers, all retrying workers would hit it simultaneously at the same backoff intervals (the "thundering herd" problem). Adding random jitter spreads retries over time, preventing sudden spikes that could cause the provider to fail again.
Template System & Personalization
Notifications use templates to separate content from logic. A template is defined once and rendered with per-user data at send time.
// Template definition (stored in template service)
{
"id": "order_shipped",
"channels": {
"push": {
"title": "Your order is on its way!",
"body": "Hi {{user.first_name}}, your order {{order.id}} has shipped via {{order.carrier}}. Track it here."
},
"email": {
"subject": "Order {{order.id}} Shipped",
"html_template": "email/order_shipped.mjml",
"plain_text": "Hi {{user.first_name}}, your order {{order.id}} shipped..."
},
"sms": {
"body": "{{user.first_name}}, order {{order.id}} shipped. Track: {{order.tracking_url}}"
}
},
"category": "transactional",
"default_priority": "standard"
}
// API call to trigger notification
POST /v1/notifications
{
"user_id": 12345,
"template_id": "order_shipped",
"params": {
"order": {
"id": "#A1B2C3",
"carrier": "FedEx",
"tracking_url": "https://track.example.com/A1B2C3"
}
},
"channels": ["push", "email"], // override defaults
"idempotency_key": "order_shipped:12345:#A1B2C3"
}The template engine resolves channel-specific content, applies user preferences (checking if the user has enabled this category on each channel), and handles localization by selecting the correct language variant based on the user's locale setting.
Analytics: Delivery, Open Rates & Click-Through
The analytics subsystem tracks the full notification lifecycle. Each state transition emits an event to a streaming pipeline (Kafka) for aggregation.
| Metric | How It Is Tracked | Typical Value |
|---|---|---|
| Delivery rate | Provider delivery receipts (APNs, FCM callbacks) | 95-99% |
| Open rate (push) | App reports "notification opened" event via SDK | 5-15% |
| Open rate (email) | Tracking pixel in email body | 15-25% |
| Click-through rate | Redirect through tracking URL before landing page | 2-5% |
| Unsubscribe rate | Unsubscribe link or preference change | < 0.5% |
| Failure rate | Provider error responses, DLQ size | < 1% |
Notification Events Flow:
Dispatch Worker
|
| (emit event: sent/failed/retried)
v
Kafka Topic: "notification.events"
|
|---> Real-time dashboard (Flink/Spark Streaming)
| - Live delivery rates
| - Failure alerts (PagerDuty)
|
|---> Batch analytics (ClickHouse / BigQuery)
- Daily/weekly reports
- A/B test results for notification copy
- Channel effectiveness comparisonStep 5: Scaling & Optimizations
- Horizontal scaling of workers: Dispatch workers are stateless. Scale them independently per channel based on queue depth. Use auto-scaling groups that respond to queue lag metrics.
- Database partitioning: Partition the notifications table by
created_at(monthly ranges). Old partitions can be archived to cold storage. User preferences are cached aggressively in Redis since they change infrequently. - Multi-region deployment: Deploy notification workers close to provider endpoints (e.g., APNs servers are in the US) to reduce network latency. Use regional queues for in-app notifications served via WebSocket.
- Batching: Group email notifications into digest messages (e.g., "You have 5 new comments") to reduce volume and improve engagement. Use a time-window aggregator before the email adapter.
- Provider failover: For SMS, maintain multiple providers (Twilio, Vonage, AWS SNS). If one provider fails or is rate-limited, the adapter automatically routes to a backup provider.
- Scheduled notification handling: A cron-based scheduler scans for notifications where
scheduled_at <= NOW()and enqueues them. Use database-level indexing onscheduled_atfor efficient polling. - Token management: Periodically clean up invalid device tokens reported by APNs/FCM feedback services. Stale tokens waste capacity and increase failure rates.
Architecture Summary
| Component | Technology | Purpose |
|---|---|---|
| Notification Service | REST/gRPC API | Validate, enrich, route notifications |
| Template Engine | Handlebars / custom | Render personalized content per channel |
| Preference Store | PostgreSQL + Redis | User opt-in/opt-out, quiet hours, channels |
| Message Queues | Kafka (or SQS) | Decouple ingestion from delivery, priority ordering |
| Dispatch Workers | Stateless consumers | Rate limit, retry, route to provider adapters |
| Provider Adapters | APNs, FCM, SMTP, Twilio | Channel-specific delivery logic |
| Analytics Pipeline | Kafka + ClickHouse | Delivery tracking, open/click rates, alerting |
| Dead Letter Queue | SQS / Kafka DLQ | Capture permanently failed messages for review |
Key Takeaways
- A notification system is fundamentally a fan-out problem: one event triggers messages across multiple channels to multiple devices. Design for channel independence so each adapter can scale and fail independently.
- Priority queues are essential. Critical notifications (OTP, security alerts) must never be delayed by a bulk marketing campaign. Use separate queues per priority tier.
- Use idempotency keys to prevent duplicate notifications. At-least-once delivery with deduplication is far simpler than exactly-once semantics.
- Rate limiting at both user and provider levels protects user experience and prevents provider throttling. Always implement exponential backoff with jitter for retries.
- The adapter pattern makes the system extensible. Adding a new channel (WhatsApp, Slack, in-app WebSocket) means implementing a single adapter interface without changing the core pipeline.