Design Video Streaming | System Design Case Study

Step 1: Clarify Requirements

Functional Requirements

Upload: Creators upload videos of varying length (seconds to hours) and resolution (up to 4K).
Transcode: Convert uploaded videos into multiple resolutions and formats for adaptive streaming.
Stream: Viewers watch videos with smooth playback, automatic quality adjustment based on bandwidth.
Search & Discovery: Users search for videos by title, tags, and description. Trending and recommended videos on the homepage.
Engagement: Like, comment, subscribe, share. View counts and watch history.
Recommendations: Personalized video suggestions based on watch history and preferences.

Non-Functional Requirements

Scale: Support 1 billion daily active users, 500 million videos in the catalog.
Availability: 99.99% uptime for playback. Upload can tolerate slightly lower availability.
Low latency: Video playback must start within 2 seconds. Seek operations under 1 second.
Durability: Uploaded videos must never be lost. 11 nines of durability for raw and transcoded assets.
Global reach: Low-latency playback across all continents via CDN.
Cost efficiency: Storage and bandwidth are the dominant costs. Optimize aggressively.

Step 2: Back-of-Envelope Estimates

EstimationUsers:               1 billion DAU
Videos in catalog:   500 million

UPLOAD:
  New uploads/day:   500,000 videos
  Average raw size:  500 MB (mix of short and long content)
  Daily raw upload:  500K * 500 MB = 250 TB/day
  After transcoding: ~3x raw size (multiple resolutions)
                     250 TB * 3 = 750 TB/day of transcoded output
  Annual storage:    750 TB * 365 = ~274 PB/year

STREAMING:
  Average watch time:  30 minutes/user/day
  Average bitrate:     5 Mbps (mix of resolutions)
  Concurrent viewers:  ~50 million (peak)
  Peak bandwidth:      50M * 5 Mbps = 250 Tbps
  (CDN handles most of this; origin serves ~1-5%)

TRANSCODING:
  500K videos/day, average 10 min each
  Transcoding to 5 resolutions = 2.5M transcoding jobs/day
  At ~2x real-time per resolution:
    Total compute: 500K * 10 min * 5 = 25M GPU-minutes/day

METADATA:
  500M videos * 2 KB metadata = 1 TB
  Comments: ~10 billion total, ~5 TB
  Watch history: 1B users * 200 entries * 50B = 10 TB

Step 3: High-Level Design

The architecture splits into two distinct paths: the upload/transcode path (top) handles ingestion, processing, and storage, while the playback path (bottom) serves video content through CDN edge servers. These paths are decoupled: a video becomes available for streaming only after the transcoding pipeline completes and metadata is written.

Step 4: Deep Dive

Upload Pipeline

Uploading large video files over unreliable networks requires a robust, resumable upload protocol. The upload service implements chunked, resumable uploads.

Initiate upload: Client sends metadata (title, description, tags). Server creates an upload session and returns a unique upload URL with a pre-signed S3 multipart upload ID.

Chunked upload: Client splits the video into 5-10 MB chunks and uploads them in parallel (up to 6 concurrent connections). Each chunk includes an MD5 checksum for integrity verification.

Resume on failure: If the connection drops, the client queries the server for which chunks were received and resumes from the last incomplete chunk. No data is re-uploaded.

Complete upload: Once all chunks arrive, the server triggers S3 multipart completion to assemble the final object. The raw video is now stored durably.

Post-upload processing: A message is published to the transcode queue. In parallel, the virus scanner checks the file and metadata extraction reads video properties (duration, codec, resolution, framerate).

Upload API// 1. Initiate upload
POST /v1/videos/upload
{
  "title": "System Design in 10 Minutes",
  "description": "Quick overview of system design...",
  "tags": ["system-design", "tutorial"],
  "file_size": 524288000,   // 500 MB
  "content_type": "video/mp4"
}
Response: {
  "upload_id": "upl_abc123",
  "upload_url": "https://s3.amazonaws.com/raw-videos/...",
  "chunk_size": 5242880,    // 5 MB recommended
  "total_chunks": 100
}

// 2. Upload each chunk
PUT /v1/videos/upload/{upload_id}/chunks/{chunk_number}
Headers: Content-MD5: {checksum}
Body: [binary chunk data]

// 3. Complete upload
POST /v1/videos/upload/{upload_id}/complete
Response: {
  "video_id": "vid_xyz789",
  "status": "processing",
  "estimated_ready": "2026-02-18T10:30:00Z"
}

Transcoding Pipeline

Transcoding converts a single uploaded video into multiple renditions (resolutions and bitrates) for adaptive streaming. This is the most compute-intensive part of the system.

The pipeline is modeled as a Directed Acyclic Graph (DAG) of tasks:

Transcoding DAG                    ┌─────────────┐
                    │  Raw Video  │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   Split     │  Split into 10-sec segments
                    │  into GOP   │  (Group of Pictures)
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
   │ Encode 1080p│ │ Encode 720p │ │ Encode 480p │  ... + 360p, 240p
   │  H.264/H.265│ │  H.264      │ │  H.264      │
   └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
                    ┌──────▼──────┐
                    │   Package   │  Generate HLS (.m3u8 + .ts)
                    │  HLS/DASH   │  and DASH (.mpd + .m4s)
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼──────┐ ┌──▼───┐ ┌─────▼──────┐
       │  Thumbnails │ │ DRM  │ │  Upload to  │
       │  Generation │ │Encrypt│ │ CDN Origin  │
       └─────────────┘ └──────┘ └────────────┘

HLS (HTTP Live Streaming)

Apple's protocol. Dominant on iOS, Safari, and most players.
Uses .m3u8 playlist files and .ts (MPEG-TS) segments.
Segments are typically 2-10 seconds long.
Master playlist references multiple quality levels.
Widely supported: works on nearly every device.

DASH (Dynamic Adaptive Streaming)

International standard (ISO 23009). Codec-agnostic.
Uses .mpd manifest and .m4s (fMP4) segments.
Supports more flexible segment durations.
Better DRM integration (Widevine, PlayReady).
Preferred for Android and smart TV platforms.

Why Split Before Encoding?

Splitting the video into short segments (GOP-aligned) before encoding enables massive parallelism. Instead of encoding a 2-hour video sequentially on one machine (which would take ~4 hours), you split it into 720 ten-second segments and encode each in parallel across hundreds of workers. This reduces total transcoding time from hours to minutes.

Video Storage Architecture

Video storage is the largest cost center. A well-designed storage strategy uses tiered storage and intelligent lifecycle policies.

Storage Tier	Content	Access Pattern	Cost (relative)
Hot (S3 Standard)	Videos uploaded in the last 30 days, popular videos	Frequent reads from CDN origin	$$$
Warm (S3 IA)	Videos 30-180 days old, moderate views	Occasional CDN origin pulls	$$
Cold (S3 Glacier)	Videos older than 180 days, rarely viewed	Rare access, minutes to retrieve	$
Archive (Glacier Deep)	Raw uploads (kept for re-transcoding)	Almost never accessed	$0.10

Storage Layouts3://video-platform-raw/
  └── {video_id}/
      └── original.mp4              # Raw upload (archive after processing)

s3://video-platform-transcoded/
  └── {video_id}/
      ├── master.m3u8               # HLS master playlist
      ├── 1080p/
      │   ├── playlist.m3u8         # 1080p variant playlist
      │   ├── segment_000.ts        # 2-second segments
      │   ├── segment_001.ts
      │   └── ...
      ├── 720p/
      │   ├── playlist.m3u8
      │   └── ...
      ├── 480p/
      │   └── ...
      ├── 360p/
      │   └── ...
      ├── thumbnails/
      │   ├── poster.jpg            # Main thumbnail
      │   ├── sprite.jpg            # Thumbnail sprite for scrubbing
      │   └── preview.webm          # 5-second hover preview
      └── subtitles/
          ├── en.vtt
          └── es.vtt

Metadata Database

SchemaTABLE videos (
    id              UUID        PRIMARY KEY,
    creator_id      BIGINT      NOT NULL REFERENCES users(id),
    title           VARCHAR(200) NOT NULL,
    description     TEXT,
    duration_sec    INT,
    status          ENUM('uploading','processing','ready','failed','removed')
                    DEFAULT 'uploading',
    visibility      ENUM('public','unlisted','private') DEFAULT 'public',
    storage_path    VARCHAR(500),                   -- S3 prefix for transcoded files
    original_path   VARCHAR(500),                   -- S3 path for raw upload
    view_count      BIGINT      DEFAULT 0,
    like_count      BIGINT      DEFAULT 0,
    created_at      TIMESTAMP   DEFAULT NOW(),
    published_at    TIMESTAMP
);

INDEX idx_creator   ON videos(creator_id, created_at DESC);
INDEX idx_status    ON videos(status) WHERE status = 'processing';
INDEX idx_trending  ON videos(view_count DESC, published_at DESC);

TABLE video_renditions (
    video_id        UUID        NOT NULL REFERENCES videos(id),
    resolution      VARCHAR(10) NOT NULL,           -- '1080p', '720p', '480p'
    bitrate_kbps    INT         NOT NULL,
    codec           VARCHAR(20) NOT NULL,           -- 'h264', 'h265', 'vp9', 'av1'
    format          ENUM('hls','dash') NOT NULL,
    segment_count   INT,
    total_size_mb   INT,
    playlist_path   VARCHAR(500),
    PRIMARY KEY (video_id, resolution, format)
);

TABLE watch_history (
    user_id         BIGINT      NOT NULL,
    video_id        UUID        NOT NULL,
    watched_at      TIMESTAMP   DEFAULT NOW(),
    watch_duration  INT,                            -- seconds watched
    last_position   INT,                            -- resume position in seconds
    PRIMARY KEY (user_id, video_id)
);

Streaming: Adaptive Bitrate (ABR)

Adaptive bitrate streaming is the key to smooth playback across varying network conditions. The player dynamically switches between quality levels based on available bandwidth.

Request manifest: The player fetches the master playlist (master.m3u8) from the CDN. This file lists all available quality levels with their bitrates.

Bandwidth estimation: The player downloads the first segment and measures download speed. Based on this, it selects the highest quality level that can be sustained.

Segment-by-segment switching: For each subsequent segment, the player re-evaluates bandwidth. If the network degrades, it switches down to a lower bitrate mid-stream. If bandwidth improves, it switches up.

Buffer management: The player maintains a buffer of 10-30 seconds of video. If the buffer drops below a threshold, it aggressively switches to a lower quality to prevent stalling.

HLS Master Playlist#EXTM3U
#EXT-X-VERSION:6

#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
1080p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=1280x720,CODECS="avc1.4d401f,mp4a.40.2"
720p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=854x480,CODECS="avc1.4d401e,mp4a.40.2"
480p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360,CODECS="avc1.42e01e,mp4a.40.2"
360p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=426x240,CODECS="avc1.42e00a,mp4a.40.2"
240p/playlist.m3u8

CDN Edge Caching Strategy

Not all video segments are equally popular. The first few segments of a video are requested most often (many users click a video and leave within seconds). CDNs should prioritize caching early segments. For long-tail content that is rarely watched, the CDN will issue an origin pull on the first request, then cache locally for subsequent viewers. Popular videos are pre-warmed to edge locations before they trend.

Content Recommendation

The recommendation engine drives engagement by surfacing relevant videos. At scale, recommendations account for over 70% of all video views on platforms like YouTube.

Collaborative Filtering

"Users who watched X also watched Y."
Build a user-item interaction matrix from watch history.
Use matrix factorization (ALS) or neural collaborative filtering to find latent factors.
Good for discovering content outside a user's usual interests.
Cold start problem: cannot recommend for new users or new videos with no watch data.

Content-Based Filtering

"This video is similar to others you have watched."
Extract features from video metadata: title, tags, description, category, creator.
Compute similarity scores using TF-IDF or embedding vectors.
Works well for new users (uses explicit preferences) and new content.
Tends to create "filter bubbles": recommending only similar content.

In practice, production systems use a two-stage approach:

Candidate generation: A lightweight model retrieves hundreds of candidate videos from a pool of millions (using approximate nearest neighbors on embedding vectors).
Ranking: A heavier model (deep neural network) scores each candidate based on features like watch history, time of day, device, video freshness, creator affinity, and predicted watch time. The top results are served.

Cost Optimization

Storage and bandwidth dominate costs at scale. Key strategies to control expenses:

Strategy	Savings	Trade-off
Storage tiering (hot/warm/cold)	60-80% on storage	Cold videos have seconds of latency on first access
Codec optimization (H.265/AV1)	30-50% bitrate reduction at same quality	Higher transcoding cost, older devices may not support
Lazy transcoding	Save compute on never-watched videos	First viewer of a rare video experiences delay
CDN caching	90%+ reduction in origin bandwidth	Cache invalidation complexity
Delete low-value renditions	20-40% storage reduction	If requested, must re-transcode from raw
Per-title encoding	20-30% bitrate reduction	Requires per-video encoding analysis (extra compute)

Lazy Transcoding

Instead of transcoding every uploaded video into all resolutions immediately, transcode only the most common resolutions (720p, 480p) upfront. Higher resolutions (1080p, 4K) are transcoded on-demand when a viewer requests them, then cached. This dramatically reduces compute costs since many videos are never watched in high resolution, and some are never watched at all.

Step 5: Scaling & Optimizations

Upload scaling: Use pre-signed URLs to upload directly to S3, bypassing your servers entirely. The upload service only handles metadata and orchestration, not data transfer.
Transcoding scaling: Use spot/preemptible GPU instances for transcoding (70-90% cost savings). Jobs are idempotent and restartable, so preemption is safe. Auto-scale worker pools based on queue depth.
CDN multi-layer caching: Use a two-tier CDN: edge PoPs (200+ locations) for hot content, and regional mid-tier caches to reduce origin load for warm content. Cache hit ratios should exceed 95%.
Database scaling: Separate the read-heavy metadata queries (video info, search) from write-heavy analytics (view counts, watch history). Use read replicas for metadata. Use Redis for real-time view count aggregation, flushing to the database periodically.
Search: Index video metadata in Elasticsearch for full-text search. Use separate indices for titles, tags, and descriptions with boosted relevance scoring. Auto-complete and typo correction via n-gram tokenizers.
Live streaming extension: For live content, replace the transcode pipeline with real-time encoders (OBS -> RTMP ingest -> live transcoder -> HLS/DASH segments pushed to CDN in near-real-time). Latency target: 3-10 seconds.
View count accuracy: At billions of views per day, real-time counting is expensive. Use a write-back cache: increment in Redis, flush to PostgreSQL every 30 seconds. Accept slight inconsistency in displayed counts.

Architecture Summary

Component	Technology	Purpose
Upload Service	API + S3 multipart	Chunked, resumable video upload
Raw Storage	S3 (Glacier archive)	Durable storage of original files
Transcode Queue	Kafka / SQS	Decouple upload from processing
Transcoding Pipeline	FFmpeg on GPU workers	DAG: split, encode, package HLS/DASH
Video Storage	S3 (tiered)	Transcoded segments, thumbnails, subtitles
Metadata DB	PostgreSQL + read replicas	Video info, renditions, watch history
CDN	CloudFront / Akamai	Edge caching, global low-latency delivery
Streaming API	REST API	Auth, manifest URLs, playback tokens
Recommendation	ML pipeline (ALS + DNN)	Personalized video suggestions
Search	Elasticsearch	Full-text video search with autocomplete

Key Takeaways

Video streaming is dominated by storage and bandwidth costs. Every architectural decision (codec choice, storage tiering, CDN caching, lazy transcoding) should be evaluated through a cost lens.
The transcoding pipeline as a DAG enables massive parallelism. Splitting video into segments and encoding each independently reduces processing time from hours to minutes.
Adaptive bitrate streaming (HLS/DASH) is essential for smooth playback. The player dynamically adjusts quality based on network conditions, preventing buffering while maximizing visual quality.
CDN is not optional: it is a core architectural component. At scale, 95%+ of all video bytes should be served from edge caches, not the origin. Pre-warm popular content and use tiered caching for the long tail.
Separate the upload path from the playback path entirely. They have different availability requirements, scaling characteristics, and failure modes. Playback must be 99.99% available; upload can tolerate occasional delays.

Case Study: Design a Video Streaming Platform