Design Case Studies · Lesson 10

Design: Video Conferencing API

A single "join meeting" click kicks off three separate sub-systems: a REST call that books a room, a WebSocket handshake that negotiates participants, and a UDP media stream that carries the actual video. The skill is keeping these planes separate and letting each one optimise for what it does best.

⏱ 18 min Difficulty: advanced Prereq: REST, WebSockets, sockets & UDP

By the end you'll be able to

Articulate why video conferencing splits into a control plane and a media plane, and what goes in each.
Sketch the REST + WebSocket + SFU API surface and explain every message exchanged during join.
Evaluate the latency budget across the full join flow and identify where each millisecond goes.

1 — Requirements

Before drawing a single endpoint, nail down what this system actually has to do. Video conferencing is a deceptively wide surface.

Functional requirements

Room lifecycle: create a meeting, generate a shareable link, let others join, end it. A meeting has an id, a host, a title, and a scheduled start time.
Participant signaling: every join/leave event must propagate to all current participants in real time — you cannot poll for this.
Media routing: audio and video from each participant must reach every other participant with sub-200 ms end-to-end latency.
Many participants: target 500 viewers, 25 active cameras. The naive approach of everyone sending to everyone (mesh) breaks down past 4–5 peers.
Presence: who is muted, who is sharing screen, who has raised their hand — all participants need a consistent view of room state.

Non-functional requirements

Latency: signaling under 100 ms; audio under 150 ms glass-to-glass; video under 200 ms.
Reliability: temporary packet loss must not freeze frames — graceful degradation beats accuracy.
Security: join tokens prevent uninvited guests; media must be encrypted in transit (SRTP/DTLS).
Scalability: meeting creation is bursty; media load scales with participant count × bitrate, not meeting count.

2 — Design decisions

Decision 1: Separate the control plane from the media plane

This is the central insight. Two different jobs require two different protocols:

The control plane manages rooms, issues tokens, and tracks who is in a meeting. It is infrequently used, needs a reliable delivery guarantee, and can tolerate 50–100 ms latency. REST over HTTP/2 is a natural fit.
The media plane carries audio and video frames continuously. It needs sub-50 ms per-hop latency and must tolerate packet loss gracefully — late frames are discarded rather than retransmitted. This rules out TCP. UDP is the only realistic choice. (See the note on TCP vs UDP in Lesson 05 — Sockets.)

Decision 2: WebSocket for signaling

Between the REST control plane and the UDP media plane sits a signaling layer. Signaling messages are small (a few hundred bytes), must arrive in order, and must be pushed to all participants without polling. HTTP long-poll and SSE could handle server-to-client pushes, but participants also need to send to the server mid-call (mute, raise hand). WebSockets provide a persistent, bidirectional channel over TCP — exactly what signaling needs without any polling overhead.

Decision 3: Selective Forwarding Unit (SFU) instead of a mesh

In a peer-to-peer mesh, every participant sends their stream to every other participant. With N participants, each client uploads (N−1) streams. At N = 10, that is 9 simultaneous uploads — saturating a typical home connection. An SFU is a server that each client sends exactly one upstream to; the SFU then selects which streams to forward downstream to each subscriber. Upload cost per client stays fixed at 1× regardless of meeting size.

Decision 4: Short-lived join tokens

Meeting IDs are shareable links — they are not secrets. The join flow therefore requires a separate credential: a short-lived JWT that encodes {"meetingId", "participantId", "role", "exp"}. The client exchanges the meeting ID (plus auth) for a token from the REST API, then presents that token to the media server. The media server validates the token signature without calling back to the REST API — keeping the join hot path off the database.

Decision 5: Presence over the signaling WebSocket

Presence state (muted, hand raised, screen-sharing) is small, changes infrequently, and must be consistent across all participants. Routing it through the signaling WebSocket — not as media — keeps it decoupled from video packets and gives the server a single place to broadcast state changes.

Three planes, three protocols. Control (teal, REST), signaling (blue, WebSocket), media (amber, UDP/WebRTC). Clients connect to all three; the SFU fans media out without passing through the signaling path.

3 — The API model

Control plane (REST)

These endpoints are called once per meeting lifecycle — not per frame.

# Create a meeting (host)
POST /v1/meetings
Authorization: Bearer <user-token>
Content-Type: application/json

{
  "title": "Weekly standup",
  "scheduled_at": "2025-09-01T09:00:00Z",
  "settings": { "max_participants": 25, "waiting_room": true }
}

# Response
HTTP/1.1 201 Created
{
  "id": "mtg_9kZxp2",
  "join_url": "https://meet.example.com/j/9kZxp2",
  "host_key": "hk_..."
}

# Join a meeting — returns a short-lived token + media server address
POST /v1/meetings/mtg_9kZxp2/join
Authorization: Bearer <user-token>

# Response: token is signed by REST API; media server validates it directly
HTTP/1.1 200 OK
{
  "participant_id": "pt_Uw83",
  "join_token": "eyJhbGc...",       // JWT, exp = 5 min
  "sfu_endpoint": "wss://sfu-eu-west.example.com",
  "signal_endpoint": "wss://sig-eu-west.example.com",
  "ice_servers": [                        // STUN/TURN for NAT traversal
    { "urls": "stun:stun.example.com:3478" },
    { "urls": "turn:turn.example.com:443", "credential": "..." }
  ]
}

Signaling plane (WebSocket)

After joining, the client opens a WebSocket to signal_endpoint and presents the join_token in the initial HTTP upgrade headers. All subsequent messages are JSON envelopes typed by "type".

// Client → server: announce presence and SDP offer
{
  "type": "participant.hello",
  "participant_id": "pt_Uw83",
  "display_name": "Divya",
  "sdp_offer": "v=0\r\no=..."   // WebRTC SDP for media negotiation
}

// Server → client: SDP answer from SFU
{
  "type": "sfu.answer",
  "sdp_answer": "v=0\r\no=..."
}

// Server → all: someone joined
{
  "type": "room.participant_joined",
  "participant": { "id": "pt_Uw83", "display_name": "Divya", "role": "attendee" }
}

// Client → server: presence update
{
  "type": "presence.update",
  "audio_muted": true,
  "hand_raised": false,
  "screen_sharing": false
}

Join sequence in full. REST handles authentication once; WebSocket carries signaling throughout the call; UDP carries media after the DTLS handshake. The SFU never touches the REST API — the JWT is self-validating.

4 — Evaluation & latency budget

Control vs media plane: why the separation holds

The REST control plane sees at most one request per participant per join — it scales with meeting creation events, not with call duration or packet rate. It can live behind a standard HTTP gateway with ordinary database-backed auth. The UDP media plane, by contrast, handles thousands of packets per second per participant (audio at 50 pps, 720p video at ~30 fps). Routing those through an HTTP layer would add at minimum one TCP round trip — 50 ms or more — per packet, which is incompatible with real-time audio perception. Lesson 05 (Sockets) covers why TCP's head-of-line blocking is lethal for media.

Latency budget for the join flow

Step	Protocol	Typical latency	Notes
POST /meetings/:id/join	HTTPS / TCP	40–80 ms	One DB read for meeting, JWT signing. Can be cached.
WebSocket upgrade	TCP	20–40 ms	One extra round trip (101 Switching Protocols).
SDP offer/answer via signaling	WebSocket	10–30 ms	Signaling server relays to SFU; answer sent back.
ICE candidate gathering	STUN/TURN / UDP	50–200 ms	Worst case: symmetric NAT requires TURN relay.
DTLS handshake	UDP (DTLS 1.2)	20–60 ms	1–2 round trips for key exchange.
First audio frame	SRTP/UDP	≈ 140–410 ms total	Budget dominated by ICE, especially behind symmetric NAT.

SFU fan-out under load

At 25 cameras × 1.5 Mbps each, the SFU handles 37.5 Mbps of inbound media. If it forwards each stream to all 24 other participants: 24 × 25 × 1.5 = 900 Mbps outbound. In practice, SFUs use simulcast (clients send three resolution tiers) and subscriber-side bandwidth estimation to send each recipient only what their connection can absorb. A well-implemented SFU sends each participant at most its own subscribed bandwidth, collapsing the 900 Mbps theoretical figure to roughly the sum of each participant's available downlink.

🎯 Interview angle

Interviewers often ask "why not just use HTTP for everything?" The sharp answer is: HTTP is TCP-based, and TCP guarantees delivery by retransmitting lost packets. In audio, a retransmit arrives 50–100 ms too late to play — you're better off interpolating over the gap. That's why real-time media uses UDP and tolerates loss rather than retransmitting. Signaling is different: a lost "participant left" message would leave the UI wrong. TCP delivery guarantees are worth the slight latency there.

⚠️ Common trap

Putting media routing through the signaling WebSocket. WebSocket is TCP-backed — a large video keyframe burst can cause head-of-line blocking that stalls the control messages trying to share the same connection. Always run media on a dedicated UDP transport; keep the WebSocket exclusively for signaling events.

✅ Do this, not that

Do issue short-lived JWTs from the REST API and validate them on the media server without a callback. Don't have the SFU call the REST API on every participant connect — that couples the media hot path to your database and introduces 40–80 ms of latency plus a correlated failure mode: your DB going down kills active calls.

Under the hood: the core mechanism

Three architectural facts determine everything about how a real SFU works. Understanding them lets you reason about scaling, latency, and failure modes without guessing.

Why mesh breaks: N(N-1) streams

In a pure peer-to-peer mesh each participant opens a direct connection to every other participant. With N active cameras, each client uploads N-1 streams and downloads N-1 streams:

Participants	Streams per client	Total streams in meeting	Upload demand at 1.5 Mbps
2	1 up + 1 down	2	1.5 Mbps
4	3 up + 3 down	12	4.5 Mbps (saturates home upload)
10	9 up + 9 down	90	13.5 Mbps (impossible on most lines)
25	24 up + 24 down	600	36 Mbps — catastrophic

The SFU collapses the upload cost to exactly 1 stream per sender regardless of participant count. Each client opens one UDP connection to the SFU and sends one stream; the SFU fans it out to all subscribers. Download cost per subscriber equals the number of streams they choose to receive — typically the N-1 other participants — but that load now falls on the SFU's outbound bandwidth, not the sender's upload.

Signaling (WebSocket/TCP) vs media (WebRTC/UDP): two completely separate pipes

These are not different message types on the same connection — they are different transport protocols with different network sockets:

Property	Signaling (WebSocket)	Media (WebRTC/UDP)
Transport	TCP (inside WS)	UDP (SRTP)
Delivery	Guaranteed, ordered	Best-effort, unordered
Loss handling	TCP retransmits	Lost frame = interpolate / conceal
Message size	Tens to hundreds of bytes	1–1200 byte RTP packets at 50 pps audio, 30 fps video
Volume	Handful of events per second for the whole call	Thousands of packets per second per participant
Port	443 (WSS)	Negotiated via ICE; often 443 UDP via TURN

A signaling message that delivers a "mute" event 80 ms late is acceptable. An audio packet delivered 80 ms late has already missed its playback slot — it is more disruptive than silence. These fundamentally different tolerance profiles demand separate transports.

Worked trace: 3 participants join in sequence

Alice is already in the room. Bob joins. Then Carol joins. Trace every stream that the SFU handles:

─── Alice joins (already in room) ───────────────────────────────
Alice → SFU:  1 upstream  (Alice's video)
SFU stores:   track[Alice] = stream_A

─── Bob joins ────────────────────────────────────────────────────
Bob  → REST API:  POST /v1/meetings/mtg_9kZxp2/join
REST → Bob:       { join_token, sfu_endpoint }

Bob  → Signaling: WS upgrade + participant.hello + sdp_offer
Signaling → SFU:  relay SDP offer
SFU → Signaling:  sdp_answer (SFU will accept Bob's upstream + subscribe to Alice)
Signaling → Bob:  sfu.answer

# ICE + DTLS handshake establishes Bob ↔ SFU UDP path
Bob → SFU:  1 upstream (Bob's video)       ← new stream in
SFU → Bob:  Alice's stream (forwarded)      ← 1 stream out to Bob
SFU → Alice: Bob's stream (forwarded)       ← 1 stream out to Alice

Signaling → Alice: room.participant_joined { id: Bob }  ← UI update

Total SFU state:  track[Alice], track[Bob]
Streams flowing: 2 upstreams in, 2 downstreams out (one per subscriber)

─── Carol joins ──────────────────────────────────────────────────
Carol → REST, Signaling, SFU:  same join flow

Carol → SFU:  1 upstream (Carol's video)
SFU → Carol:  Alice's stream + Bob's stream   ← 2 streams out to Carol
SFU → Alice:  Carol's stream                   ← Alice now gets 2 downstreams total
SFU → Bob:    Carol's stream                   ← Bob now gets 2 downstreams total

Signaling → Alice, Bob: room.participant_joined { id: Carol }

Final SFU state (3 participants, all cameras on):
  Upstreams:   3  (one per sender, fixed at O(1) per client)
  Downstreams: 6  (each participant receives 2 others' streams)
  Compare mesh: 6 upstreams + 6 downstreams = 12 connections, all from client machines

When simulcast is enabled, each sender actually sends 3 resolution tiers (e.g. 1080p, 360p, 180p). The SFU forwards only the tier that matches each subscriber's available bandwidth — determined by RTCP receiver reports. This collapses bandwidth further without the sender doing any extra work beyond the 3-tier upload.

Each participant uploads exactly once (amber). The SFU fans out two streams per subscriber (green). Upload cost is fixed at O(1) per sender; download cost scales with the number of subscribed streams, but that burden falls on the SFU, not on the senders.

Operating & debugging it

A video call failure looks different depending on which plane broke. The single most useful first step is identifying whether the symptom is in the signaling plane (WebSocket), the media plane (UDP/SRTP), or the join flow (REST + ICE). Each has distinct observable symptoms and distinct tools.

Production inspection

# 1. Check the REST join endpoint is healthy curl -i -X POST https://api.example.com/v1/meetings/mtg_9kZxp2/join \ -H "Authorization: Bearer $USER_TOKEN" HTTP/1.1 200 OK ← join_token, sfu_endpoint, signal_endpoint returned HTTP/1.1 503 ← REST API down; participant cannot start the join flow # 2. Verify the signaling WebSocket accepts connections wscat -c "wss://sig-eu-west.example.com" \ -H "Authorization: Bearer $JOIN_TOKEN" Connected (press CTRL+C to quit) ← signaling reachable error: connect ETIMEDOUT ← firewall or server down # 3. Probe the STUN/TURN server (ICE health check) stunclient stun.example.com 3478 Binding test: pass (server reflexive address returned) ← STUN reachable Binding test: failed (no response) ← STUN port blocked # 4. In Chrome: inspect WebRTC internals for an active call chrome://webrtc-internals → select the PeerConnection → check ICE state, candidatePair state (should be "succeeded"), audio/video packetsReceived, and jitterBufferDelay. packetsSent > 0 confirms upstream is flowing. # 5. SFU-side: monitor active stream counts (Prometheus example) sfu_active_upstreams{meeting="mtg_9kZxp2"} 3 ← one per camera-on participant sfu_active_downstreams{meeting="mtg_9kZxp2"} 6 ← 3 participants × 2 each sfu_packet_loss_percent{track="pt_Uw83"} 0.3 ← healthy; alert above 5%

Symptom → cause → fix

Symptom	Likely cause	Where to look	Fix
"Join" button spins, never connects	REST API down or join token request failing	Browser Network tab → POST /v1/meetings/:id/join status code	Check REST API health; if 401 confirm user token is valid; if 503 scale API tier
Joins REST but hangs on "Connecting…"	ICE gathering stuck — STUN unreachable or TURN needed but not configured	chrome://webrtc-internals → ICE candidates; "relay" candidate present?	Ensure TURN server is in `ice_servers` response; test TURN port 443 UDP is open
One-way audio (can hear others, they can't hear me)	Upstream to SFU not flowing; microphone permission denied or wrong device	webrtc-internals → outbound-rtp audio → `packetsSent` stuck at 0	Confirm microphone permission; check SFU inbound stream count for that participant
Frozen video after 30–60 s, audio fine	Video bitrate exceeded subscriber's bandwidth; simulcast not configured or SFU not adapting	webrtc-internals → `qualityLimitationReason` = "bandwidth"; RTCP receiver reports	Enable simulcast; ensure SFU reads RTCP feedback and downgrades forwarded tier
Signaling events delayed (mute lag, join notifications slow)	WebSocket message queue backup; signaling server overloaded	Signaling server metrics → WS message queue depth, handler latency	Scale signaling tier; check for expensive operations blocking the WS event loop
Participant sees themselves in their own video grid	SFU forwarding sender's stream back to themselves	SFU subscription table: is sender subscribed to their own track?	Exclude self-track from subscription list when building subscriber routes
Media drops completely when signaling message is large	Large SDP or screen-share track renegotiation on WebSocket stalls signaling TCP buffer, causing apparent media stall	WS frame sizes; check for 100 KB+ SDP messages during renegotiation	Separate media from signaling; never route media over the WS connection

⚠️ The ICE restart trap

When a participant's IP changes mid-call (Wi-Fi handoff, VPN toggle), the existing ICE candidate pair becomes invalid. WebRTC supports ICE restart — a new offer/answer exchange that gathers fresh candidates — but it must be triggered through the signaling channel. If the signaling WebSocket is also broken (e.g. the device was offline briefly), neither ICE restart nor signaling recovery can proceed without reconnecting the WebSocket first. Always implement exponential-backoff WebSocket reconnection before attempting ICE restart: the signaling channel is the prerequisite for media recovery.

🧠 Quick check

1. Why does video conferencing media use UDP rather than TCP?

TCP's retransmit guarantee is a liability for real-time media. A packet that arrives 80 ms late is useless — you've already passed the playback deadline. UDP lets the application decide how to handle loss (interpolate, conceal) rather than waiting for a retransmit.

2. In an SFU architecture, how many upstream video streams does each client send?

The SFU's job is to break the N×N mesh. Each client uploads once to the SFU; the SFU fans that stream out to subscribers. Upload cost is O(1) per client, not O(N).

3. A join token expires 5 minutes after issue. What problem does this solve?

Meeting IDs are public. The token is the secret — it binds a specific user to a specific meeting at a specific time. A short expiry limits replay attacks: capturing a token and trying to join later fails because the token has expired.

4. In the join latency budget, which step typically contributes the most variance?

ICE gathering involves probing multiple candidate paths (host, server-reflexive, relay). When TURN is required — typically behind a corporate symmetric NAT — gathering can take 150–200 ms on its own, dwarfing every other step in the join sequence.

✍️ Exercise: design the "raise hand" feature end to end

A participant clicks "Raise hand." Within 500 ms, every other participant's UI should show a hand icon next to that person's name. No media plane involvement. Sketch the full flow: which plane carries it, what the message looks like, and how the server broadcasts it.

Model answer:

Raise hand is a presence event — it carries no audio or video, it is small, and delivery matters (a dropped event would leave the UI inconsistent). It belongs entirely on the signaling WebSocket.

// Client → signaling server
{ "type": "presence.update", "hand_raised": true }

// Signaling server → broadcast to all participants in room
{
  "type": "room.presence_changed",
  "participant_id": "pt_Uw83",
  "hand_raised": true
}

The server also updates its in-memory room state so latecomers receive current presence in their initial room snapshot. No REST API call needed — this is an in-call event, not a meeting lifecycle event.

Rubric: ✓ Routed through signaling, not REST ✓ Correctly uses a broadcast pattern ✓ Noted latejoiner state sync ✓ Did not involve the media plane ✓ Identified why TCP delivery guarantee matters here (presence consistency).

Key takeaways

Split control from media. REST for room lifecycle (infrequent, needs reliability), UDP/WebRTC for media (continuous, tolerates loss).
WebSocket is the signaling bridge. Bidirectional, low-overhead, TCP-backed — right for presence and SDP negotiation, wrong for video frames.
An SFU keeps each client's upload cost at O(1) by receiving one stream and fanning it out, rather than running a mesh.
Short-lived JWTs decouple the media hot path from the database — the SFU validates the token signature locally without a callback.
The join latency budget is dominated by ICE gathering, not by the REST call or SDP exchange.