Step 1: Clarify Requirements
Functional Requirements
- Upload: Creators upload videos of varying length (seconds to hours) and resolution (up to 4K).
- Transcode: Convert uploaded videos into multiple resolutions and formats for adaptive streaming.
- Stream: Viewers watch videos with smooth playback, automatic quality adjustment based on bandwidth.
- Search & Discovery: Users search for videos by title, tags, and description. Trending and recommended videos on the homepage.
- Engagement: Like, comment, subscribe, share. View counts and watch history.
- Recommendations: Personalized video suggestions based on watch history and preferences.
Non-Functional Requirements
- Scale: Support 1 billion daily active users, 500 million videos in the catalog.
- Availability: 99.99% uptime for playback. Upload can tolerate slightly lower availability.
- Low latency: Video playback must start within 2 seconds. Seek operations under 1 second.
- Durability: Uploaded videos must never be lost. 11 nines of durability for raw and transcoded assets.
- Global reach: Low-latency playback across all continents via CDN.
- Cost efficiency: Storage and bandwidth are the dominant costs. Optimize aggressively.
Step 2: Back-of-Envelope Estimates
Users: 1 billion DAU
Videos in catalog: 500 million
UPLOAD:
New uploads/day: 500,000 videos
Average raw size: 500 MB (mix of short and long content)
Daily raw upload: 500K * 500 MB = 250 TB/day
After transcoding: ~3x raw size (multiple resolutions)
250 TB * 3 = 750 TB/day of transcoded output
Annual storage: 750 TB * 365 = ~274 PB/year
STREAMING:
Average watch time: 30 minutes/user/day
Average bitrate: 5 Mbps (mix of resolutions)
Concurrent viewers: ~50 million (peak)
Peak bandwidth: 50M * 5 Mbps = 250 Tbps
(CDN handles most of this; origin serves ~1-5%)
TRANSCODING:
500K videos/day, average 10 min each
Transcoding to 5 resolutions = 2.5M transcoding jobs/day
At ~2x real-time per resolution:
Total compute: 500K * 10 min * 5 = 25M GPU-minutes/day
METADATA:
500M videos * 2 KB metadata = 1 TB
Comments: ~10 billion total, ~5 TB
Watch history: 1B users * 200 entries * 50B = 10 TBStep 3: High-Level Design
The architecture splits into two distinct paths: the upload/transcode path (top) handles ingestion, processing, and storage, while the playback path (bottom) serves video content through CDN edge servers. These paths are decoupled: a video becomes available for streaming only after the transcoding pipeline completes and metadata is written.
Step 4: Deep Dive
Upload Pipeline
Uploading large video files over unreliable networks requires a robust, resumable upload protocol. The upload service implements chunked, resumable uploads.
// 1. Initiate upload
POST /v1/videos/upload
{
"title": "System Design in 10 Minutes",
"description": "Quick overview of system design...",
"tags": ["system-design", "tutorial"],
"file_size": 524288000, // 500 MB
"content_type": "video/mp4"
}
Response: {
"upload_id": "upl_abc123",
"upload_url": "https://s3.amazonaws.com/raw-videos/...",
"chunk_size": 5242880, // 5 MB recommended
"total_chunks": 100
}
// 2. Upload each chunk
PUT /v1/videos/upload/{upload_id}/chunks/{chunk_number}
Headers: Content-MD5: {checksum}
Body: [binary chunk data]
// 3. Complete upload
POST /v1/videos/upload/{upload_id}/complete
Response: {
"video_id": "vid_xyz789",
"status": "processing",
"estimated_ready": "2026-02-18T10:30:00Z"
}Transcoding Pipeline
Transcoding converts a single uploaded video into multiple renditions (resolutions and bitrates) for adaptive streaming. This is the most compute-intensive part of the system.
The pipeline is modeled as a Directed Acyclic Graph (DAG) of tasks:
βββββββββββββββ
β Raw Video β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β Split β Split into 10-sec segments
β into GOP β (Group of Pictures)
ββββββββ¬βββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β β β
ββββββββΌβββββββ ββββββββΌβββββββ ββββββββΌβββββββ
β Encode 1080pβ β Encode 720p β β Encode 480p β ... + 360p, 240p
β H.264/H.265β β H.264 β β H.264 β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
ββββββββββββββββββΌβββββββββββββββββ
β
ββββββββΌβββββββ
β Package β Generate HLS (.m3u8 + .ts)
β HLS/DASH β and DASH (.mpd + .m4s)
ββββββββ¬βββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
ββββββββΌβββββββ ββββΌββββ βββββββΌβββββββ
β Thumbnails β β DRM β β Upload to β
β Generation β βEncryptβ β CDN Origin β
βββββββββββββββ ββββββββ ββββββββββββββHLS (HTTP Live Streaming)
- Apple's protocol. Dominant on iOS, Safari, and most players.
- Uses
.m3u8playlist files and.ts(MPEG-TS) segments. - Segments are typically 2-10 seconds long.
- Master playlist references multiple quality levels.
- Widely supported: works on nearly every device.
DASH (Dynamic Adaptive Streaming)
- International standard (ISO 23009). Codec-agnostic.
- Uses
.mpdmanifest and.m4s(fMP4) segments. - Supports more flexible segment durations.
- Better DRM integration (Widevine, PlayReady).
- Preferred for Android and smart TV platforms.
Splitting the video into short segments (GOP-aligned) before encoding enables massive parallelism. Instead of encoding a 2-hour video sequentially on one machine (which would take ~4 hours), you split it into 720 ten-second segments and encode each in parallel across hundreds of workers. This reduces total transcoding time from hours to minutes.
Video Storage Architecture
Video storage is the largest cost center. A well-designed storage strategy uses tiered storage and intelligent lifecycle policies.
| Storage Tier | Content | Access Pattern | Cost (relative) |
|---|---|---|---|
| Hot (S3 Standard) | Videos uploaded in the last 30 days, popular videos | Frequent reads from CDN origin | $$$ |
| Warm (S3 IA) | Videos 30-180 days old, moderate views | Occasional CDN origin pulls | $$ |
| Cold (S3 Glacier) | Videos older than 180 days, rarely viewed | Rare access, minutes to retrieve | $ |
| Archive (Glacier Deep) | Raw uploads (kept for re-transcoding) | Almost never accessed | $0.10 |
s3://video-platform-raw/
βββ {video_id}/
βββ original.mp4 # Raw upload (archive after processing)
s3://video-platform-transcoded/
βββ {video_id}/
βββ master.m3u8 # HLS master playlist
βββ 1080p/
β βββ playlist.m3u8 # 1080p variant playlist
β βββ segment_000.ts # 2-second segments
β βββ segment_001.ts
β βββ ...
βββ 720p/
β βββ playlist.m3u8
β βββ ...
βββ 480p/
β βββ ...
βββ 360p/
β βββ ...
βββ thumbnails/
β βββ poster.jpg # Main thumbnail
β βββ sprite.jpg # Thumbnail sprite for scrubbing
β βββ preview.webm # 5-second hover preview
βββ subtitles/
βββ en.vtt
βββ es.vttMetadata Database
TABLE videos (
id UUID PRIMARY KEY,
creator_id BIGINT NOT NULL REFERENCES users(id),
title VARCHAR(200) NOT NULL,
description TEXT,
duration_sec INT,
status ENUM('uploading','processing','ready','failed','removed')
DEFAULT 'uploading',
visibility ENUM('public','unlisted','private') DEFAULT 'public',
storage_path VARCHAR(500), -- S3 prefix for transcoded files
original_path VARCHAR(500), -- S3 path for raw upload
view_count BIGINT DEFAULT 0,
like_count BIGINT DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
published_at TIMESTAMP
);
INDEX idx_creator ON videos(creator_id, created_at DESC);
INDEX idx_status ON videos(status) WHERE status = 'processing';
INDEX idx_trending ON videos(view_count DESC, published_at DESC);
TABLE video_renditions (
video_id UUID NOT NULL REFERENCES videos(id),
resolution VARCHAR(10) NOT NULL, -- '1080p', '720p', '480p'
bitrate_kbps INT NOT NULL,
codec VARCHAR(20) NOT NULL, -- 'h264', 'h265', 'vp9', 'av1'
format ENUM('hls','dash') NOT NULL,
segment_count INT,
total_size_mb INT,
playlist_path VARCHAR(500),
PRIMARY KEY (video_id, resolution, format)
);
TABLE watch_history (
user_id BIGINT NOT NULL,
video_id UUID NOT NULL,
watched_at TIMESTAMP DEFAULT NOW(),
watch_duration INT, -- seconds watched
last_position INT, -- resume position in seconds
PRIMARY KEY (user_id, video_id)
);Streaming: Adaptive Bitrate (ABR)
Adaptive bitrate streaming is the key to smooth playback across varying network conditions. The player dynamically switches between quality levels based on available bandwidth.
master.m3u8) from the CDN. This file lists all available quality levels with their bitrates.#EXTM3U
#EXT-X-VERSION:6
#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
1080p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=1280x720,CODECS="avc1.4d401f,mp4a.40.2"
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=854x480,CODECS="avc1.4d401e,mp4a.40.2"
480p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360,CODECS="avc1.42e01e,mp4a.40.2"
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=426x240,CODECS="avc1.42e00a,mp4a.40.2"
240p/playlist.m3u8Not all video segments are equally popular. The first few segments of a video are requested most often (many users click a video and leave within seconds). CDNs should prioritize caching early segments. For long-tail content that is rarely watched, the CDN will issue an origin pull on the first request, then cache locally for subsequent viewers. Popular videos are pre-warmed to edge locations before they trend.
Content Recommendation
The recommendation engine drives engagement by surfacing relevant videos. At scale, recommendations account for over 70% of all video views on platforms like YouTube.
Collaborative Filtering
- "Users who watched X also watched Y."
- Build a user-item interaction matrix from watch history.
- Use matrix factorization (ALS) or neural collaborative filtering to find latent factors.
- Good for discovering content outside a user's usual interests.
- Cold start problem: cannot recommend for new users or new videos with no watch data.
Content-Based Filtering
- "This video is similar to others you have watched."
- Extract features from video metadata: title, tags, description, category, creator.
- Compute similarity scores using TF-IDF or embedding vectors.
- Works well for new users (uses explicit preferences) and new content.
- Tends to create "filter bubbles": recommending only similar content.
In practice, production systems use a two-stage approach:
- Candidate generation: A lightweight model retrieves hundreds of candidate videos from a pool of millions (using approximate nearest neighbors on embedding vectors).
- Ranking: A heavier model (deep neural network) scores each candidate based on features like watch history, time of day, device, video freshness, creator affinity, and predicted watch time. The top results are served.
Cost Optimization
Storage and bandwidth dominate costs at scale. Key strategies to control expenses:
| Strategy | Savings | Trade-off |
|---|---|---|
| Storage tiering (hot/warm/cold) | 60-80% on storage | Cold videos have seconds of latency on first access |
| Codec optimization (H.265/AV1) | 30-50% bitrate reduction at same quality | Higher transcoding cost, older devices may not support |
| Lazy transcoding | Save compute on never-watched videos | First viewer of a rare video experiences delay |
| CDN caching | 90%+ reduction in origin bandwidth | Cache invalidation complexity |
| Delete low-value renditions | 20-40% storage reduction | If requested, must re-transcode from raw |
| Per-title encoding | 20-30% bitrate reduction | Requires per-video encoding analysis (extra compute) |
Instead of transcoding every uploaded video into all resolutions immediately, transcode only the most common resolutions (720p, 480p) upfront. Higher resolutions (1080p, 4K) are transcoded on-demand when a viewer requests them, then cached. This dramatically reduces compute costs since many videos are never watched in high resolution, and some are never watched at all.
Step 5: Scaling & Optimizations
- Upload scaling: Use pre-signed URLs to upload directly to S3, bypassing your servers entirely. The upload service only handles metadata and orchestration, not data transfer.
- Transcoding scaling: Use spot/preemptible GPU instances for transcoding (70-90% cost savings). Jobs are idempotent and restartable, so preemption is safe. Auto-scale worker pools based on queue depth.
- CDN multi-layer caching: Use a two-tier CDN: edge PoPs (200+ locations) for hot content, and regional mid-tier caches to reduce origin load for warm content. Cache hit ratios should exceed 95%.
- Database scaling: Separate the read-heavy metadata queries (video info, search) from write-heavy analytics (view counts, watch history). Use read replicas for metadata. Use Redis for real-time view count aggregation, flushing to the database periodically.
- Search: Index video metadata in Elasticsearch for full-text search. Use separate indices for titles, tags, and descriptions with boosted relevance scoring. Auto-complete and typo correction via n-gram tokenizers.
- Live streaming extension: For live content, replace the transcode pipeline with real-time encoders (OBS -> RTMP ingest -> live transcoder -> HLS/DASH segments pushed to CDN in near-real-time). Latency target: 3-10 seconds.
- View count accuracy: At billions of views per day, real-time counting is expensive. Use a write-back cache: increment in Redis, flush to PostgreSQL every 30 seconds. Accept slight inconsistency in displayed counts.
Architecture Summary
| Component | Technology | Purpose |
|---|---|---|
| Upload Service | API + S3 multipart | Chunked, resumable video upload |
| Raw Storage | S3 (Glacier archive) | Durable storage of original files |
| Transcode Queue | Kafka / SQS | Decouple upload from processing |
| Transcoding Pipeline | FFmpeg on GPU workers | DAG: split, encode, package HLS/DASH |
| Video Storage | S3 (tiered) | Transcoded segments, thumbnails, subtitles |
| Metadata DB | PostgreSQL + read replicas | Video info, renditions, watch history |
| CDN | CloudFront / Akamai | Edge caching, global low-latency delivery |
| Streaming API | REST API | Auth, manifest URLs, playback tokens |
| Recommendation | ML pipeline (ALS + DNN) | Personalized video suggestions |
| Search | Elasticsearch | Full-text video search with autocomplete |
Key Takeaways
- Video streaming is dominated by storage and bandwidth costs. Every architectural decision (codec choice, storage tiering, CDN caching, lazy transcoding) should be evaluated through a cost lens.
- The transcoding pipeline as a DAG enables massive parallelism. Splitting video into segments and encoding each independently reduces processing time from hours to minutes.
- Adaptive bitrate streaming (HLS/DASH) is essential for smooth playback. The player dynamically adjusts quality based on network conditions, preventing buffering while maximizing visual quality.
- CDN is not optional: it is a core architectural component. At scale, 95%+ of all video bytes should be served from edge caches, not the origin. Pre-warm popular content and use tiered caching for the long tail.
- Separate the upload path from the playback path entirely. They have different availability requirements, scaling characteristics, and failure modes. Playback must be 99.99% available; upload can tolerate occasional delays.