Design Case Studies · Lesson 07
Design: Video Streaming API
Uploading a 4K film and streaming it to a billion simultaneous viewers are two completely different problems — yet a video platform must solve both in the same system. This case study traces the full pipeline from the creator's browser through transcoding workers to the viewer's adaptive player, exposing the API design decisions that make each stage work at scale.
By the end you'll be able to
- Sketch the upload → transcode → CDN delivery pipeline and name the API surface at each stage.
- Explain why presigned URLs, async transcoding, and HLS/DASH manifests each exist, and what breaks if you skip them.
- Estimate CDN hit-ratio math and articulate the origin-load savings it produces.
Requirements
Before drawing any boxes, pin down what the system must do. Video platforms are deceptively complex because they sit at the intersection of three very different workloads: large-file ingest, CPU-intensive batch processing, and globally distributed read-heavy delivery. Getting requirements wrong means you optimize for the wrong bottleneck.
Functional requirements
- Upload large video files — creators submit raw footage up to tens of gigabytes. The upload must be resumable: if the connection drops at 95%, the creator should not restart from zero.
- Transcode to multiple resolutions — the raw upload gets converted into at least four renditions: 360p, 720p, 1080p, and 4K where the source permits. Each rendition is cut into short segments (typically 2–6 seconds) for adaptive streaming.
- Adaptive playback via HLS or MPEG-DASH — the player continuously monitors available bandwidth and switches renditions mid-stream without interrupting playback. This requires a manifest file that describes every available quality tier and where its segments live.
- View count tracking at massive scale — billions of view events per day. Real-time exact counts are unnecessary and expensive; approximate counts updated every few minutes are fine.
Non-functional requirements
- Huge read-to-write ratio — for every video uploaded, it may be watched hundreds of millions of times. The system must be optimized almost entirely for reads. A back-of-envelope: YouTube processes ~500 hours of video per minute of upload, but serves billions of hours of watch time per day. That ratio is roughly 1 write for every 50,000+ reads.
- Low start latency — the first segment must start playing within ~2 seconds on a good connection. CDN edge caching of the first few segments is not optional.
- Transcode SLA — a 10-minute 1080p video should be available in all resolutions within 5 minutes of publish. A 2-hour 4K film may take 30–60 minutes. The upload response must not block on this.
Design decisions
Every major decision in this system comes with a "why" — and for each one, there is a common alternative that fails at scale. Interview panels reward candidates who explain what breaks before explaining what they chose.
Decision 1: Presigned URLs for upload (not API-proxied upload)
Naively, a creator POSTs their video to your API server, which writes it to S3. This works for 10 MB profile photos. For a 20 GB film, it means every byte travels through your API fleet twice: once inbound (creator → API server → object storage) and once outbound when served. Your API servers become the bottleneck, your egress bill doubles, and a single slow upload ties up a connection slot.
The fix: the API server issues a presigned URL — a time-limited, signed S3/GCS endpoint — and returns it to the client. The client uploads directly to object storage, bypassing the API server entirely. The API server sees only the tiny metadata request and the final "upload complete" callback. See the File Upload case study (cs-02) for the full presigned URL pattern including resumable chunked uploads.
Decision 2: Async transcode pipeline (not synchronous in-request)
Transcoding a 4K video takes minutes of wall-clock time and gigabytes of scratch disk. Doing it synchronously — holding the HTTP connection open until the job finishes — would mean upload requests time out, clients implement complex retry logic, and a transcode backlog crashes your API. Instead, completing the upload triggers an event on a queue. Transcode workers pull jobs, process in parallel, and emit completion events when done. The API responds immediately with a 202 Accepted and a job handle. Clients poll or receive a webhook. See the Event-driven & Pub/Sub lesson (rel-10) for the queue mechanics.
Decision 3: CDN delivery for segments (not direct origin serving)
Once transcoded, video segments are immutable bytes that never change. CDN edge nodes are designed precisely for serving immutable, cacheable content to geographically distributed audiences. Serving segments from your origin for even 1% of requests would require a network and cost investment that rivals the CDN itself. The CDN handles the read surge; the origin only exists to fill cache misses. See the Caching lesson (rel-07) for CDN cache semantics including Cache-Control: public, max-age=31536000, immutable for segment files.
Decision 4: HLS/DASH manifests for adaptive streaming
A plain MP4 download requires the client to buffer the entire file before seeking reliably. Adaptive streaming solves three problems at once: bandwidth adaptation (drop from 1080p to 360p mid-stream when the network degrades), fast start (buffer only the first 2–4 segments before playing), and seeking (jump directly to the segment containing the target timestamp). HLS uses a .m3u8 playlist; MPEG-DASH uses an XML .mpd manifest. Both index the same underlying segments.
Decision 5: Webhook + polling for transcode status
Clients need to know when a video is ready. Two mechanisms work together: polling for creator dashboards that can tolerate a GET request every 5 seconds, and webhooks for server-to-server integrations that want push notification. Never make the client wait on a long-poll — transcode jobs take minutes and connection timeouts make long-polling unreliable for this duration.
The API model
Six endpoints cover the full lifecycle. Note that the HLS manifest and segments live on the CDN domain — they are not routes on your API server.
POST /v1/videos — initiate upload
The creator sends metadata; the API returns a video ID and a presigned upload URL. No video bytes travel through the API server.
// Request
POST /v1/videos
Authorization: Bearer {token}
Content-Type: application/json
{
"title": "Climbing the Eiger: North Face",
"description": "Solo ascent, summer 2025.",
"content_type": "video/mp4",
"file_size": 8472983040 // bytes; used to configure multipart upload
}
// Response 201 Created
{
"video_id": "vid_2wXk9mPqRn7v",
"status": "awaiting_upload",
"upload_url": "https://uploads.example-cdn.com/vid_2wXk9mPqRn7v?X-Amz-Signature=...",
"upload_method": "PUT",
"upload_expires_at": "2026-06-20T18:30:00Z" // presigned URL TTL: 1 hour
}
PUT {upload_url} — chunked upload direct to object storage
The client PUTs to the presigned URL. For large files, S3 multipart upload allows up to 10,000 parts, each 5 MB–5 GB. The API server sees nothing.
// Multipart part (repeated for each 50 MB chunk)
PUT https://uploads.example-cdn.com/vid_2wXk9mPqRn7v?partNumber=3&uploadId=xKFdH...
Content-Length: 52428800
Content-Type: video/mp4
[binary chunk body]
// Response 200 from object storage
ETag: "d8e8fca2dc0f896fd7cb4cb0031ba249" // save for CompleteMultipartUpload
POST /v1/videos/:id/publish — trigger transcode
After all parts are uploaded and CompleteMultipartUpload succeeds, the client calls publish. This moves the video out of draft and enqueues the transcode job. Returns immediately with 202.
// Request
POST /v1/videos/vid_2wXk9mPqRn7v/publish
Authorization: Bearer {token}
// Response 202 Accepted
{
"video_id": "vid_2wXk9mPqRn7v",
"status": "transcoding",
"transcode_job_id": "tjob_8RpQn3LvMz1w",
"estimated_completion_at": "2026-06-20T17:25:00Z"
}
GET /v1/videos/:id — fetch video metadata
Returns the canonical video record: manifest URL, metadata, current transcode status, and available renditions once processing completes.
// Request
GET /v1/videos/vid_2wXk9mPqRn7v
Authorization: Bearer {token}
// Response 200 OK (video ready)
{
"video_id": "vid_2wXk9mPqRn7v",
"title": "Climbing the Eiger: North Face",
"status": "ready",
"duration_s": 4320,
"view_count": 142873,
"manifest_url": "https://cdn.example.com/vid_2wXk9mPqRn7v/manifest.m3u8",
"thumbnail_url": "https://cdn.example.com/vid_2wXk9mPqRn7v/thumb.jpg",
"renditions": [
{ "quality": "360p", "bitrate_kbps": 400 },
{ "quality": "720p", "bitrate_kbps": 2500 },
{ "quality": "1080p", "bitrate_kbps": 5000 },
{ "quality": "4K", "bitrate_kbps": 18000 }
]
}
GET /v1/videos/:id/transcode-status — poll async job
Fine-grained progress for creator dashboards. Separate from GET /videos/:id to avoid cache-busting the main resource — the main endpoint can be cached aggressively once the video is ready.
// Response 200 OK (in-progress)
{
"job_id": "tjob_8RpQn3LvMz1w",
"status": "transcoding",
"progress_pct": 62,
"current_pass": "1080p",
"passes_done": 2,
"passes_total": 4,
"eta_seconds": 183
}
// Response 200 OK (complete)
{
"job_id": "tjob_8RpQn3LvMz1w",
"status": "complete",
"progress_pct": 100,
"completed_at": "2026-06-20T17:22:47Z"
}
GET /{id}/manifest.m3u8 — HLS manifest (CDN, not API server)
This is not an API route — it is a file served directly from the CDN origin bucket. The URL is returned in the video metadata; the player fetches it independently.
# HLS master playlist — served from cdn.example.com, not api.example.com
#EXTM3U
#EXT-X-VERSION:3
# Each variant stream points to a sub-playlist of 4-second segments
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=640x360,CODECS="avc1.42c01e,mp4a.40.2"
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720,CODECS="avc1.4d401f,mp4a.40.2"
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
1080p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=18000000,RESOLUTION=3840x2160,CODECS="avc1.640033,mp4a.40.2"
4k/playlist.m3u8
When asked "design YouTube" in a system design interview, examiners want to hear three pipeline stages with specific justifications: (1) upload via presigned URL so your API fleet never touches video bytes; (2) async transcode via a job queue so the upload response is immediate and transcode workers scale independently; (3) CDN delivery with a target hit ratio above 95% so origin load is a rounding error. Bonus points: mention that the manifest file and segments use separate Cache-Control TTLs — manifests are short-lived (30 s) while segment files are immutable (max-age=31536000). Tying CDN hit ratio back to a latency budget (1st byte <50 ms from edge vs. ~200 ms from origin) demonstrates systems thinking beyond "add a CDN."
Diagrams
Two mistakes appear so often in system design interviews that they are worth calling out explicitly.
Proxying video bytes through your API server is the first. Your API fleet has finite network bandwidth — typically a few Gbps per instance. A single 4K stream at 18 Mbps times 10,000 concurrent viewers = 180 Gbps. No API fleet handles that without becoming the single most expensive line item on your AWS bill. Object storage + CDN costs a fraction of the equivalent API fleet bandwidth.
Synchronous transcoding inside the upload handler is the second. Transcoding a 1-hour 4K video can take 20–40 minutes of CPU time. Holding the HTTP connection open for that duration means: every client needs to implement timeout/retry logic; a transcode backlog ties up API worker threads; a single slow job degrades every concurrent upload. Async queues exist precisely for workloads with unbounded wall-clock duration.
Use three distinct hostnames for the three traffic types: api.example.com for your REST API, uploads.example.com (or a direct S3/GCS URL) for object storage ingest, and cdn.example.com for media delivery. This separation buys you several things: you can apply different rate limits and autoscaling policies per domain; CDN caching rules apply cleanly to cdn.example.com without accidentally caching API responses; and your TLS certificates, WAF rules, and CORS headers stay untangled. Never route media bytes through api.example.com — the second you do, your CDN configuration becomes impossible to reason about.
Evaluation & latency budget
CDN offload math
Video segments are the definition of cacheable: they are immutable (the bytes for the 00:02:00–00:02:04 window of a given rendition never change), they are large (a 4-second 1080p segment at 5 Mbps is ~2.5 MB), and they are popular at the head of the view distribution. Once a segment is in a CDN edge node's cache, every subsequent viewer in that region gets it without touching the origin.
Suppose a video has 1,000,000 views. Each view fetches an average of 30 segments (2 minutes of a 72-minute film). That is 30,000,000 segment requests. With a 98% CDN hit ratio:
- Origin serves: 30,000,000 × 0.02 = 600,000 requests
- CDN serves: 29,400,000 requests at edge latency (<20 ms)
- Without CDN: 30,000,000 requests at origin — 50× the origin fleet required
The first viewer per region per segment pays the origin-fetch cost (~150 ms). All subsequent viewers pay edge cost (~15 ms). Popular videos have effective hit ratios above 99.5% because global distribution means even a video with 100M views fills each edge cache quickly.
Async transcode: why synchronous would fail
Consider a 2-hour 4K film. A single-pass transcode at real-time ratio of ~6:1 takes ~20 minutes per quality tier × 4 tiers = ~80 minutes of sequential work, or ~20 minutes in parallel. The HTTP keep-alive timeout on most load balancers is 60–300 seconds. The upload handler would time out before the first rendition finishes. Even with long-polling or WebSockets, holding a server connection for 20 minutes consumes one worker thread that cannot serve other requests. At 10,000 concurrent uploads, that is 10,000 idle-but-blocked threads — the exact scenario Node.js async queues were designed to avoid.
The async pattern trades a single complex HTTP response for a simple 202 + a job-status poll. The complexity moves out of the HTTP layer and into the queue worker, where it belongs.
Latency breakdown table
| Request type | Served from | Typical latency (P50) | Typical latency (P99) |
|---|---|---|---|
| HLS master manifest (first load) | CDN edge (short TTL — 30 s) | 18 ms | 55 ms |
| Video segment (CDN hit) | CDN edge (immutable, long TTL) | 12 ms | 35 ms |
| Video segment (CDN miss, origin fill) | Object storage via CDN | 155 ms | 420 ms |
| GET /v1/videos/:id (metadata) | API server + Redis cache | 28 ms | 90 ms |
| POST /v1/videos/:id/publish | API server (writes DB + enqueues) | 45 ms | 140 ms |
Back-of-envelope: view scale
YouTube serves roughly 1 billion hours of video per day. That is approximately 1.16 × 1012 seconds of playback per day, or ~13.4 million concurrent streams. At an average bitrate of 2 Mbps (mix of mobile 360p and desktop 1080p), that is ~26.8 Tbps of sustained CDN egress. No single origin data center can deliver that — CDN edge distribution is not an optimization, it is the architecture.
For a mid-scale platform targeting 10 million monthly active viewers with 30 minutes average watch time per day:
- Daily playback seconds: 10M × 30 × 60 = 18 billion seconds
- Avg bitrate 1.5 Mbps: ~27 TB/day egress
- CDN cost at $0.01/GB: ~$270/day — affordable for a serious product
- Origin-only cost at $0.09/GB: ~$2,430/day — 9× more, and still slower for users
Under the hood: the transcode pipeline
The words "upload → transcode → CDN" hide a multi-stage pipeline with specific data structures at every handoff. Here is how each stage actually works.
Stage 1: chunked upload to object storage
The client uses S3 multipart upload. It calls CreateMultipartUpload to receive an uploadId, then PUTs each 50–500 MB chunk as a numbered part. The object storage returns an ETag (MD5 of the part bytes) for each part. When all parts are uploaded, the client calls CompleteMultipartUpload with the ordered list of (partNumber, ETag) pairs. Object storage atomically assembles the parts into a single object. If the connection drops mid-upload, the client resumes from the last successful part number — only that part's bytes need to be retransmitted.
# Multipart state machine
CreateMultipartUpload → uploadId = "xKFdH..."
PutObject part 1 → ETag: "abc123"
PutObject part 2 → ETag: "def456"
PutObject part N → ETag: "xyz789"
CompleteMultipartUpload(uploadId, [(1,"abc123"), (2,"def456"), ...(N,"xyz789")])
→ raw video object assembled atomically in object storage
Stage 2: job queue and transcode workers
Object storage fires an s3:ObjectCreated event (or equivalent) when CompleteMultipartUpload succeeds. This event is published to a queue (SQS, Cloud Pub/Sub). A transcode job record is created in a database table with this schema:
-- Transcode job record (simplified)
job_id TEXT PRIMARY KEY, -- "tjob_8RpQn3LvMz1w"
video_id TEXT NOT NULL,
status TEXT NOT NULL, -- queued | processing | complete | failed
renditions JSONB, -- per-rendition progress: [{quality:"1080p", status:"done", pct:100}, ...]
attempt INT DEFAULT 0,
max_attempts INT DEFAULT 3,
worker_id TEXT, -- which machine owns this job
locked_until TIMESTAMPTZ, -- distributed lock; worker must renew or job is re-queued
created_at TIMESTAMPTZ,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
A transcode worker pulls the job, sets worker_id and locked_until = now() + 10min, and begins FFmpeg transcoding. If the worker dies without renewing the lock, a supervisor re-queues the job for another worker. This prevents lost jobs without a single coordinator.
Stage 3: producing the rendition ladder
Each worker runs one or more FFmpeg passes. A typical rendition ladder for a 1080p source:
| Rendition | Resolution | Video codec | Target bitrate | Segment duration |
|---|---|---|---|---|
| 360p | 640×360 | H.264 baseline | 400 kbps | 4 s |
| 720p | 1280×720 | H.264 main | 2,500 kbps | 4 s |
| 1080p | 1920×1080 | H.264 high | 5,000 kbps | 4 s |
| 4K | 3840×2160 | H.265 / VP9 | 18,000 kbps | 4 s |
After FFmpeg produces each rendition as a continuous stream, a segmenter (FFmpeg with -f hls or a dedicated tool like Bento4) cuts it into fixed-duration .ts or .mp4 fragments and writes a per-rendition playlist.m3u8. A segment naming convention like 1080p/seg_0000.ts, 1080p/seg_0001.ts, ... makes each segment addressable independently, enabling the CDN to cache individual 2.5 MB chunks rather than multi-gigabyte files.
Stage 4: writing the HLS master manifest
Once all rendition playlists are written to object storage, the worker assembles the master manifest (manifest.m3u8) that links them together. Each #EXT-X-STREAM-INF line carries the BANDWIDTH (bits/second) and RESOLUTION that the ABR algorithm uses to choose a rendition. The manifest is written last — writing it atomically "publishes" the video: before it exists, any player request 404s; after it exists, the full ladder is reachable.
Stage 5: how adaptive bitrate picks a rendition at playback
The player implements an ABR algorithm. A simple throughput-based algorithm works as follows:
- Player downloads segment N of the current rendition and measures actual download throughput (bytes received / time taken).
- It applies a safety factor (e.g. 0.8×) to avoid oscillation:
safe_throughput = measured × 0.8. - It scans the master manifest's
BANDWIDTHvalues from highest to lowest and picks the first one at or belowsafe_throughput. - For segment N+1 it requests the chosen rendition's next segment.
- If the buffer falls below a threshold (e.g. 4 s), it immediately steps down one quality tier regardless of throughput.
The player never interrupts playback during a quality switch — it finishes playing the buffered segments of the old rendition while fetching the new rendition's next segment. Because all renditions share the same GOP (group of pictures) alignment and segment duration, the switch point is seamless.
Worked transcode job trace
Timeline for a 10-minute 1080p upload (7 GB raw file, 4-worker parallel transcode):
Operating & debugging it
The transcode pipeline has four independently observable stages. Most production issues fall into one of them.
Key metrics to monitor
| Metric | Where to observe | Alert threshold (example) |
|---|---|---|
| Queue depth (jobs waiting) | SQS / Pub/Sub console; CloudWatch | >500 jobs queued for >5 min → scale workers |
| Job age (oldest queued job) | Custom metric: now() - created_at for status=queued | >10 min → worker may be stuck or under-provisioned |
| Transcode failure rate | Status=failed jobs / total jobs; log-based metric | >2% failure rate → investigate FFmpeg errors |
| Worker lock renewal failures | Application logs for "lock expired, job re-queued" | Any → worker OOM or crash |
| CDN hit ratio | CDN analytics dashboard; X-Cache HIT/MISS header sampling | <95% → check Cache-Control headers on segments |
| Origin segment request rate | Object storage access logs; CloudFront origin requests | Spike → CDN miss storm, likely new viral video or TTL misconfiguration |
Inspecting a stuck or failed transcode
| Symptom | Likely cause | Fix |
|---|---|---|
Job stuck in queued for >5 min | No available workers; queue backlog; workers crashed | Scale out worker fleet; check worker logs for OOM or crash; verify queue subscription is active |
| Job repeatedly fails with "lock expired" | Worker machine is too slow or OOM; lock TTL too short | Increase worker instance size; extend lock TTL; ensure lock renewal is running on a background thread |
| Manifest 404 after job shows "complete" | Worker wrote manifest to wrong path; object storage replication lag | Compare expected vs actual manifest key in object storage; check worker path-generation code |
| CDN hit ratio drops suddenly | Cache-Control header removed from segments; CDN config change; new viral video warming edge caches | curl -I a segment URL and check cache-control; new viral video miss storm resolves itself quickly |
| Player stalls at quality switches | GOP misalignment between renditions; segment duration mismatch | Ensure all renditions use the same keyframe interval and segment duration in FFmpeg flags |
| FFmpeg "Invalid data" error | Corrupted or truncated upload; incomplete CompleteMultipartUpload | Re-upload the source file; verify all part ETags before calling CompleteMultipartUpload |
If your worker writes manifest.m3u8 as soon as the first rendition completes, CDN edges worldwide will immediately cache a manifest that references rendition playlists that do not yet exist. Players will fetch those playlists and get 404s for every quality tier above the one that finished first. The master manifest must be written atomically after all rendition playlist.m3u8 files are confirmed present in object storage. In a multi-worker setup, use a distributed counter or barrier: each worker atomically decrements a "renditions remaining" counter; the worker that brings it to zero writes the manifest.
Quiz
🧠 Quick check
Q1: Why use presigned URLs for video upload instead of routing the bytes through your API server?
The API server's network capacity is sized for JSON requests, not multi-gigabyte binary transfers. A presigned URL lets the client write directly to object storage — the API server issues only a small signed token, then steps completely out of the data path. Your API fleet stays available for the workload it was sized for.
Q2: Why is adaptive streaming (HLS/DASH) preferable to serving a single MP4 file for video playback?
Adaptive bitrate streaming continuously monitors download throughput and switches between renditions mid-playback — seamlessly stepping from 1080p down to 360p when a mobile viewer moves into weak signal, then back up again. A single MP4 at a fixed bitrate either buffers constantly on a slow connection or under-serves a fast viewer who could get better quality.
Q3: Your video platform achieves a CDN segment cache hit ratio of 98%. What does that mean for your origin servers?
A 98% hit ratio means 98 out of every 100 segment requests are served directly from CDN edge caches without reaching the origin at all. Your origin fleet only processes the remaining 2% — typically the first viewer per region per segment. This is what allows a platform with 10 million concurrent viewers to run on a modest origin cluster rather than a data-center-scale serving fleet.
Practice: design the transcode job schema and completion webhook
You are designing the internal job record that tracks a video transcode, and the webhook payload that fires when the job completes. A well-designed schema here makes retry logic, observability, and client integrations far simpler.
Part 1 — Transcode job record
Design the JSON schema for a transcode job stored in your database. Your schema must support:
- Job identity and association to a video
- Current status with valid state transitions:
queued → processing → complete | failed - Per-rendition progress so the dashboard can show "720p done, 1080p 40%"
- Retry tracking: attempt number and max retries
- Timestamps for audit: created, started, completed/failed
- Worker identity (which machine picked up the job) for debugging stuck jobs
Part 2 — Completion webhook payload
Design the JSON body sent to a registered webhook URL when the transcode job finishes (success or failure). Your payload must:
- Be idempotent: include a stable event ID so receivers can deduplicate retries
- Carry enough context that the receiver never needs to make a follow-up API call for the common case
- Distinguish success from partial success (e.g., 4K failed, lower tiers succeeded) from total failure
- Include the manifest URL if the video is playable
- Be signed with an HMAC-SHA256 header so receivers can verify authenticity
Rubric
- Job schema: includes
job_id,video_id,status,renditions[]with per-rendition status/progress,attempt/max_attempts,worker_id, and four timestamps. - Idempotency: webhook body includes a stable
event_id(e.g.,evt_{job_id}_{attempt}) so re-deliveries are safe to discard. - Self-contained payload: body includes
video_id, overall status, per-rendition outcome, manifest URL (if playable), and duration so the receiver avoids a round-trip. - Partial success: status field distinguishes
complete(all renditions),partial(some renditions failed), andfailed(none usable). - Security:
X-Webhook-Signature: sha256={hmac}header described, with the signing secret documented as per-endpoint rather than global. - Bonus: the job schema uses a
locked_untiltimestamp for distributed lock management so two workers cannot double-process the same job.
Key takeaways
- Presigned URLs keep the API server out of the data path. Issuing a time-limited signed token and stepping aside is the right architecture for any large-file ingest — the API server's job is orchestration, not bandwidth.
- Async transcoding is non-negotiable for video. Any workload measured in minutes of CPU time belongs in a queue/worker, not an HTTP handler. Respond with 202 + job handle; let clients poll or receive a webhook.
- HLS/DASH solves three problems simultaneously: bandwidth adaptation, fast start via segment pre-buffering, and accurate seeking — none of which plain MP4 progressive download provides reliably.
- CDN hit ratio is the most important operational metric for a video platform. At 98%+ hit rate, origin load is a rounding error. Below ~90%, your origin bill and latency both escalate non-linearly.
- Segment files and manifest files have different optimal TTLs. Segments are immutable — use
max-age=31536000, immutable. Master manifests must expire in ~30 s so newly published renditions become visible promptly. - Separate API, upload, and CDN domains from day one. Mixing them couples your caching policy, WAF rules, and autoscaling in ways that are expensive to untangle at scale.
Sources & further reading
- Apple HTTP Live Streaming (HLS) specification — the authoritative reference for .m3u8 playlists, segment formats, and encryption.
- MPEG-DASH Industry Forum (DASHIF) — specifications and interoperability guidelines for Dynamic Adaptive Streaming over HTTP.
- AWS S3 Presigned URLs — developer guide — how to generate, scope, and expire signed upload/download URLs for S3.
- Cloudflare Stream documentation — a managed video pipeline that illustrates the upload, transcode, and playback architecture described in this lesson.
- Google / YouTube Engineering Blog — primary source for YouTube-scale infrastructure decisions including transcode pipelines and CDN architecture.