Design Case Studies · Lesson 10
Design: Video Conferencing API
A single "join meeting" click kicks off three separate sub-systems: a REST call that books a room, a WebSocket handshake that negotiates participants, and a UDP media stream that carries the actual video. The skill is keeping these planes separate and letting each one optimise for what it does best.
By the end you'll be able to
- Articulate why video conferencing splits into a control plane and a media plane, and what goes in each.
- Sketch the REST + WebSocket + SFU API surface and explain every message exchanged during join.
- Evaluate the latency budget across the full join flow and identify where each millisecond goes.
1 — Requirements
Before drawing a single endpoint, nail down what this system actually has to do. Video conferencing is a deceptively wide surface.
Functional requirements
- Room lifecycle: create a meeting, generate a shareable link, let others join, end it. A meeting has an id, a host, a title, and a scheduled start time.
- Participant signaling: every join/leave event must propagate to all current participants in real time — you cannot poll for this.
- Media routing: audio and video from each participant must reach every other participant with sub-200 ms end-to-end latency.
- Many participants: target 500 viewers, 25 active cameras. The naive approach of everyone sending to everyone (mesh) breaks down past 4–5 peers.
- Presence: who is muted, who is sharing screen, who has raised their hand — all participants need a consistent view of room state.
Non-functional requirements
- Latency: signaling under 100 ms; audio under 150 ms glass-to-glass; video under 200 ms.
- Reliability: temporary packet loss must not freeze frames — graceful degradation beats accuracy.
- Security: join tokens prevent uninvited guests; media must be encrypted in transit (SRTP/DTLS).
- Scalability: meeting creation is bursty; media load scales with participant count × bitrate, not meeting count.
2 — Design decisions
Decision 1: Separate the control plane from the media plane
This is the central insight. Two different jobs require two different protocols:
- The control plane manages rooms, issues tokens, and tracks who is in a meeting. It is infrequently used, needs a reliable delivery guarantee, and can tolerate 50–100 ms latency. REST over HTTP/2 is a natural fit.
- The media plane carries audio and video frames continuously. It needs sub-50 ms per-hop latency and must tolerate packet loss gracefully — late frames are discarded rather than retransmitted. This rules out TCP. UDP is the only realistic choice. (See the note on TCP vs UDP in Lesson 05 — Sockets.)
Decision 2: WebSocket for signaling
Between the REST control plane and the UDP media plane sits a signaling layer. Signaling messages are small (a few hundred bytes), must arrive in order, and must be pushed to all participants without polling. HTTP long-poll and SSE could handle server-to-client pushes, but participants also need to send to the server mid-call (mute, raise hand). WebSockets provide a persistent, bidirectional channel over TCP — exactly what signaling needs without any polling overhead.
Decision 3: Selective Forwarding Unit (SFU) instead of a mesh
In a peer-to-peer mesh, every participant sends their stream to every other participant. With N participants, each client uploads (N−1) streams. At N = 10, that is 9 simultaneous uploads — saturating a typical home connection. An SFU is a server that each client sends exactly one upstream to; the SFU then selects which streams to forward downstream to each subscriber. Upload cost per client stays fixed at 1× regardless of meeting size.
Decision 4: Short-lived join tokens
Meeting IDs are shareable links — they are not secrets. The join flow therefore requires a separate credential: a short-lived JWT that encodes {"meetingId", "participantId", "role", "exp"}. The client exchanges the meeting ID (plus auth) for a token from the REST API, then presents that token to the media server. The media server validates the token signature without calling back to the REST API — keeping the join hot path off the database.
Decision 5: Presence over the signaling WebSocket
Presence state (muted, hand raised, screen-sharing) is small, changes infrequently, and must be consistent across all participants. Routing it through the signaling WebSocket — not as media — keeps it decoupled from video packets and gives the server a single place to broadcast state changes.
3 — The API model
Control plane (REST)
These endpoints are called once per meeting lifecycle — not per frame.
# Create a meeting (host)
POST /v1/meetings
Authorization: Bearer <user-token>
Content-Type: application/json
{
"title": "Weekly standup",
"scheduled_at": "2025-09-01T09:00:00Z",
"settings": { "max_participants": 25, "waiting_room": true }
}
# Response
HTTP/1.1 201 Created
{
"id": "mtg_9kZxp2",
"join_url": "https://meet.example.com/j/9kZxp2",
"host_key": "hk_..."
}
# Join a meeting — returns a short-lived token + media server address
POST /v1/meetings/mtg_9kZxp2/join
Authorization: Bearer <user-token>
# Response: token is signed by REST API; media server validates it directly
HTTP/1.1 200 OK
{
"participant_id": "pt_Uw83",
"join_token": "eyJhbGc...", // JWT, exp = 5 min
"sfu_endpoint": "wss://sfu-eu-west.example.com",
"signal_endpoint": "wss://sig-eu-west.example.com",
"ice_servers": [ // STUN/TURN for NAT traversal
{ "urls": "stun:stun.example.com:3478" },
{ "urls": "turn:turn.example.com:443", "credential": "..." }
]
}
Signaling plane (WebSocket)
After joining, the client opens a WebSocket to signal_endpoint and presents the join_token in the initial HTTP upgrade headers. All subsequent messages are JSON envelopes typed by "type".
// Client → server: announce presence and SDP offer
{
"type": "participant.hello",
"participant_id": "pt_Uw83",
"display_name": "Divya",
"sdp_offer": "v=0\r\no=..." // WebRTC SDP for media negotiation
}
// Server → client: SDP answer from SFU
{
"type": "sfu.answer",
"sdp_answer": "v=0\r\no=..."
}
// Server → all: someone joined
{
"type": "room.participant_joined",
"participant": { "id": "pt_Uw83", "display_name": "Divya", "role": "attendee" }
}
// Client → server: presence update
{
"type": "presence.update",
"audio_muted": true,
"hand_raised": false,
"screen_sharing": false
}
4 — Evaluation & latency budget
Control vs media plane: why the separation holds
The REST control plane sees at most one request per participant per join — it scales with meeting creation events, not with call duration or packet rate. It can live behind a standard HTTP gateway with ordinary database-backed auth. The UDP media plane, by contrast, handles thousands of packets per second per participant (audio at 50 pps, 720p video at ~30 fps). Routing those through an HTTP layer would add at minimum one TCP round trip — 50 ms or more — per packet, which is incompatible with real-time audio perception. Lesson 05 (Sockets) covers why TCP's head-of-line blocking is lethal for media.
Latency budget for the join flow
| Step | Protocol | Typical latency | Notes |
|---|---|---|---|
| POST /meetings/:id/join | HTTPS / TCP | 40–80 ms | One DB read for meeting, JWT signing. Can be cached. |
| WebSocket upgrade | TCP | 20–40 ms | One extra round trip (101 Switching Protocols). |
| SDP offer/answer via signaling | WebSocket | 10–30 ms | Signaling server relays to SFU; answer sent back. |
| ICE candidate gathering | STUN/TURN / UDP | 50–200 ms | Worst case: symmetric NAT requires TURN relay. |
| DTLS handshake | UDP (DTLS 1.2) | 20–60 ms | 1–2 round trips for key exchange. |
| First audio frame | SRTP/UDP | ≈ 140–410 ms total | Budget dominated by ICE, especially behind symmetric NAT. |
SFU fan-out under load
At 25 cameras × 1.5 Mbps each, the SFU handles 37.5 Mbps of inbound media. If it forwards each stream to all 24 other participants: 24 × 25 × 1.5 = 900 Mbps outbound. In practice, SFUs use simulcast (clients send three resolution tiers) and subscriber-side bandwidth estimation to send each recipient only what their connection can absorb. A well-implemented SFU sends each participant at most its own subscribed bandwidth, collapsing the 900 Mbps theoretical figure to roughly the sum of each participant's available downlink.
Interviewers often ask "why not just use HTTP for everything?" The sharp answer is: HTTP is TCP-based, and TCP guarantees delivery by retransmitting lost packets. In audio, a retransmit arrives 50–100 ms too late to play — you're better off interpolating over the gap. That's why real-time media uses UDP and tolerates loss rather than retransmitting. Signaling is different: a lost "participant left" message would leave the UI wrong. TCP delivery guarantees are worth the slight latency there.
Putting media routing through the signaling WebSocket. WebSocket is TCP-backed — a large video keyframe burst can cause head-of-line blocking that stalls the control messages trying to share the same connection. Always run media on a dedicated UDP transport; keep the WebSocket exclusively for signaling events.
Do issue short-lived JWTs from the REST API and validate them on the media server without a callback. Don't have the SFU call the REST API on every participant connect — that couples the media hot path to your database and introduces 40–80 ms of latency plus a correlated failure mode: your DB going down kills active calls.
Under the hood: the core mechanism
Three architectural facts determine everything about how a real SFU works. Understanding them lets you reason about scaling, latency, and failure modes without guessing.
Why mesh breaks: N(N-1) streams
In a pure peer-to-peer mesh each participant opens a direct connection to every other participant. With N active cameras, each client uploads N-1 streams and downloads N-1 streams:
| Participants | Streams per client | Total streams in meeting | Upload demand at 1.5 Mbps |
|---|---|---|---|
| 2 | 1 up + 1 down | 2 | 1.5 Mbps |
| 4 | 3 up + 3 down | 12 | 4.5 Mbps (saturates home upload) |
| 10 | 9 up + 9 down | 90 | 13.5 Mbps (impossible on most lines) |
| 25 | 24 up + 24 down | 600 | 36 Mbps — catastrophic |
The SFU collapses the upload cost to exactly 1 stream per sender regardless of participant count. Each client opens one UDP connection to the SFU and sends one stream; the SFU fans it out to all subscribers. Download cost per subscriber equals the number of streams they choose to receive — typically the N-1 other participants — but that load now falls on the SFU's outbound bandwidth, not the sender's upload.
Signaling (WebSocket/TCP) vs media (WebRTC/UDP): two completely separate pipes
These are not different message types on the same connection — they are different transport protocols with different network sockets:
| Property | Signaling (WebSocket) | Media (WebRTC/UDP) |
|---|---|---|
| Transport | TCP (inside WS) | UDP (SRTP) |
| Delivery | Guaranteed, ordered | Best-effort, unordered |
| Loss handling | TCP retransmits | Lost frame = interpolate / conceal |
| Message size | Tens to hundreds of bytes | 1–1200 byte RTP packets at 50 pps audio, 30 fps video |
| Volume | Handful of events per second for the whole call | Thousands of packets per second per participant |
| Port | 443 (WSS) | Negotiated via ICE; often 443 UDP via TURN |
A signaling message that delivers a "mute" event 80 ms late is acceptable. An audio packet delivered 80 ms late has already missed its playback slot — it is more disruptive than silence. These fundamentally different tolerance profiles demand separate transports.
Worked trace: 3 participants join in sequence
Alice is already in the room. Bob joins. Then Carol joins. Trace every stream that the SFU handles:
─── Alice joins (already in room) ───────────────────────────────
Alice → SFU: 1 upstream (Alice's video)
SFU stores: track[Alice] = stream_A
─── Bob joins ────────────────────────────────────────────────────
Bob → REST API: POST /v1/meetings/mtg_9kZxp2/join
REST → Bob: { join_token, sfu_endpoint }
Bob → Signaling: WS upgrade + participant.hello + sdp_offer
Signaling → SFU: relay SDP offer
SFU → Signaling: sdp_answer (SFU will accept Bob's upstream + subscribe to Alice)
Signaling → Bob: sfu.answer
# ICE + DTLS handshake establishes Bob ↔ SFU UDP path
Bob → SFU: 1 upstream (Bob's video) ← new stream in
SFU → Bob: Alice's stream (forwarded) ← 1 stream out to Bob
SFU → Alice: Bob's stream (forwarded) ← 1 stream out to Alice
Signaling → Alice: room.participant_joined { id: Bob } ← UI update
Total SFU state: track[Alice], track[Bob]
Streams flowing: 2 upstreams in, 2 downstreams out (one per subscriber)
─── Carol joins ──────────────────────────────────────────────────
Carol → REST, Signaling, SFU: same join flow
Carol → SFU: 1 upstream (Carol's video)
SFU → Carol: Alice's stream + Bob's stream ← 2 streams out to Carol
SFU → Alice: Carol's stream ← Alice now gets 2 downstreams total
SFU → Bob: Carol's stream ← Bob now gets 2 downstreams total
Signaling → Alice, Bob: room.participant_joined { id: Carol }
Final SFU state (3 participants, all cameras on):
Upstreams: 3 (one per sender, fixed at O(1) per client)
Downstreams: 6 (each participant receives 2 others' streams)
Compare mesh: 6 upstreams + 6 downstreams = 12 connections, all from client machines
When simulcast is enabled, each sender actually sends 3 resolution tiers (e.g. 1080p, 360p, 180p). The SFU forwards only the tier that matches each subscriber's available bandwidth — determined by RTCP receiver reports. This collapses bandwidth further without the sender doing any extra work beyond the 3-tier upload.
Operating & debugging it
A video call failure looks different depending on which plane broke. The single most useful first step is identifying whether the symptom is in the signaling plane (WebSocket), the media plane (UDP/SRTP), or the join flow (REST + ICE). Each has distinct observable symptoms and distinct tools.
Production inspection
Symptom → cause → fix
| Symptom | Likely cause | Where to look | Fix |
|---|---|---|---|
| "Join" button spins, never connects | REST API down or join token request failing | Browser Network tab → POST /v1/meetings/:id/join status code | Check REST API health; if 401 confirm user token is valid; if 503 scale API tier |
| Joins REST but hangs on "Connecting…" | ICE gathering stuck — STUN unreachable or TURN needed but not configured | chrome://webrtc-internals → ICE candidates; "relay" candidate present? | Ensure TURN server is in ice_servers response; test TURN port 443 UDP is open |
| One-way audio (can hear others, they can't hear me) | Upstream to SFU not flowing; microphone permission denied or wrong device | webrtc-internals → outbound-rtp audio → packetsSent stuck at 0 |
Confirm microphone permission; check SFU inbound stream count for that participant |
| Frozen video after 30–60 s, audio fine | Video bitrate exceeded subscriber's bandwidth; simulcast not configured or SFU not adapting | webrtc-internals → qualityLimitationReason = "bandwidth"; RTCP receiver reports |
Enable simulcast; ensure SFU reads RTCP feedback and downgrades forwarded tier |
| Signaling events delayed (mute lag, join notifications slow) | WebSocket message queue backup; signaling server overloaded | Signaling server metrics → WS message queue depth, handler latency | Scale signaling tier; check for expensive operations blocking the WS event loop |
| Participant sees themselves in their own video grid | SFU forwarding sender's stream back to themselves | SFU subscription table: is sender subscribed to their own track? | Exclude self-track from subscription list when building subscriber routes |
| Media drops completely when signaling message is large | Large SDP or screen-share track renegotiation on WebSocket stalls signaling TCP buffer, causing apparent media stall | WS frame sizes; check for 100 KB+ SDP messages during renegotiation | Separate media from signaling; never route media over the WS connection |
When a participant's IP changes mid-call (Wi-Fi handoff, VPN toggle), the existing ICE candidate pair becomes invalid. WebRTC supports ICE restart — a new offer/answer exchange that gathers fresh candidates — but it must be triggered through the signaling channel. If the signaling WebSocket is also broken (e.g. the device was offline briefly), neither ICE restart nor signaling recovery can proceed without reconnecting the WebSocket first. Always implement exponential-backoff WebSocket reconnection before attempting ICE restart: the signaling channel is the prerequisite for media recovery.
🧠 Quick check
1. Why does video conferencing media use UDP rather than TCP?
TCP's retransmit guarantee is a liability for real-time media. A packet that arrives 80 ms late is useless — you've already passed the playback deadline. UDP lets the application decide how to handle loss (interpolate, conceal) rather than waiting for a retransmit.
2. In an SFU architecture, how many upstream video streams does each client send?
The SFU's job is to break the N×N mesh. Each client uploads once to the SFU; the SFU fans that stream out to subscribers. Upload cost is O(1) per client, not O(N).
3. A join token expires 5 minutes after issue. What problem does this solve?
Meeting IDs are public. The token is the secret — it binds a specific user to a specific meeting at a specific time. A short expiry limits replay attacks: capturing a token and trying to join later fails because the token has expired.
4. In the join latency budget, which step typically contributes the most variance?
ICE gathering involves probing multiple candidate paths (host, server-reflexive, relay). When TURN is required — typically behind a corporate symmetric NAT — gathering can take 150–200 ms on its own, dwarfing every other step in the join sequence.
✍️ Exercise: design the "raise hand" feature end to end
A participant clicks "Raise hand." Within 500 ms, every other participant's UI should show a hand icon next to that person's name. No media plane involvement. Sketch the full flow: which plane carries it, what the message looks like, and how the server broadcasts it.
Model answer:
Raise hand is a presence event — it carries no audio or video, it is small, and delivery matters (a dropped event would leave the UI inconsistent). It belongs entirely on the signaling WebSocket.
// Client → signaling server
{ "type": "presence.update", "hand_raised": true }
// Signaling server → broadcast to all participants in room
{
"type": "room.presence_changed",
"participant_id": "pt_Uw83",
"hand_raised": true
}
The server also updates its in-memory room state so latecomers receive current presence in their initial room snapshot. No REST API call needed — this is an in-call event, not a meeting lifecycle event.
Rubric: ✓ Routed through signaling, not REST ✓ Correctly uses a broadcast pattern ✓ Noted latejoiner state sync ✓ Did not involve the media plane ✓ Identified why TCP delivery guarantee matters here (presence consistency).
Key takeaways
- Split control from media. REST for room lifecycle (infrequent, needs reliability), UDP/WebRTC for media (continuous, tolerates loss).
- WebSocket is the signaling bridge. Bidirectional, low-overhead, TCP-backed — right for presence and SDP negotiation, wrong for video frames.
- An SFU keeps each client's upload cost at O(1) by receiving one stream and fanning it out, rather than running a mesh.
- Short-lived JWTs decouple the media hot path from the database — the SFU validates the token signature locally without a callback.
- The join latency budget is dominated by ICE gathering, not by the REST call or SDP exchange.
Sources & further reading
- WebRTC.org — Architecture overview
- MDN — Signaling and video calling
- BlogGeek.me — MCU vs SFU vs P2P: Which should you use?
- RFC 5245 — Interactive Connectivity Establishment (ICE)
- Zoom Engineering Blog — Optimizing Real-Time Communications
- RFC 3550 — RTP: A Transport Protocol for Real-Time Applications