API Design

Design Case Studies · Lesson 10

Design: Video Conferencing API

A single "join meeting" click kicks off three separate sub-systems: a REST call that books a room, a WebSocket handshake that negotiates participants, and a UDP media stream that carries the actual video. The skill is keeping these planes separate and letting each one optimise for what it does best.

⏱ 18 min Difficulty: advanced Prereq: REST, WebSockets, sockets & UDP

By the end you'll be able to

1 — Requirements

Before drawing a single endpoint, nail down what this system actually has to do. Video conferencing is a deceptively wide surface.

Functional requirements

Non-functional requirements

2 — Design decisions

Decision 1: Separate the control plane from the media plane

This is the central insight. Two different jobs require two different protocols:

Decision 2: WebSocket for signaling

Between the REST control plane and the UDP media plane sits a signaling layer. Signaling messages are small (a few hundred bytes), must arrive in order, and must be pushed to all participants without polling. HTTP long-poll and SSE could handle server-to-client pushes, but participants also need to send to the server mid-call (mute, raise hand). WebSockets provide a persistent, bidirectional channel over TCP — exactly what signaling needs without any polling overhead.

Decision 3: Selective Forwarding Unit (SFU) instead of a mesh

In a peer-to-peer mesh, every participant sends their stream to every other participant. With N participants, each client uploads (N−1) streams. At N = 10, that is 9 simultaneous uploads — saturating a typical home connection. An SFU is a server that each client sends exactly one upstream to; the SFU then selects which streams to forward downstream to each subscriber. Upload cost per client stays fixed at 1× regardless of meeting size.

Decision 4: Short-lived join tokens

Meeting IDs are shareable links — they are not secrets. The join flow therefore requires a separate credential: a short-lived JWT that encodes {"meetingId", "participantId", "role", "exp"}. The client exchanges the meeting ID (plus auth) for a token from the REST API, then presents that token to the media server. The media server validates the token signature without calling back to the REST API — keeping the join hot path off the database.

Decision 5: Presence over the signaling WebSocket

Presence state (muted, hand raised, screen-sharing) is small, changes infrequently, and must be consistent across all participants. Routing it through the signaling WebSocket — not as media — keeps it decoupled from video packets and gives the server a single place to broadcast state changes.

Client A browser/app Client B browser/app REST API control plane Signaling Server WebSocket / TCP SFU media plane / UDP DB / Cache rooms, tokens HTTPS REST WS UDP/WebRTC
Three planes, three protocols. Control (teal, REST), signaling (blue, WebSocket), media (amber, UDP/WebRTC). Clients connect to all three; the SFU fans media out without passing through the signaling path.

3 — The API model

Control plane (REST)

These endpoints are called once per meeting lifecycle — not per frame.

# Create a meeting (host)
POST /v1/meetings
Authorization: Bearer <user-token>
Content-Type: application/json

{
  "title": "Weekly standup",
  "scheduled_at": "2025-09-01T09:00:00Z",
  "settings": { "max_participants": 25, "waiting_room": true }
}

# Response
HTTP/1.1 201 Created
{
  "id": "mtg_9kZxp2",
  "join_url": "https://meet.example.com/j/9kZxp2",
  "host_key": "hk_..."
}
# Join a meeting — returns a short-lived token + media server address
POST /v1/meetings/mtg_9kZxp2/join
Authorization: Bearer <user-token>

# Response: token is signed by REST API; media server validates it directly
HTTP/1.1 200 OK
{
  "participant_id": "pt_Uw83",
  "join_token": "eyJhbGc...",       // JWT, exp = 5 min
  "sfu_endpoint": "wss://sfu-eu-west.example.com",
  "signal_endpoint": "wss://sig-eu-west.example.com",
  "ice_servers": [                        // STUN/TURN for NAT traversal
    { "urls": "stun:stun.example.com:3478" },
    { "urls": "turn:turn.example.com:443", "credential": "..." }
  ]
}

Signaling plane (WebSocket)

After joining, the client opens a WebSocket to signal_endpoint and presents the join_token in the initial HTTP upgrade headers. All subsequent messages are JSON envelopes typed by "type".

// Client → server: announce presence and SDP offer
{
  "type": "participant.hello",
  "participant_id": "pt_Uw83",
  "display_name": "Divya",
  "sdp_offer": "v=0\r\no=..."   // WebRTC SDP for media negotiation
}

// Server → client: SDP answer from SFU
{
  "type": "sfu.answer",
  "sdp_answer": "v=0\r\no=..."
}

// Server → all: someone joined
{
  "type": "room.participant_joined",
  "participant": { "id": "pt_Uw83", "display_name": "Divya", "role": "attendee" }
}

// Client → server: presence update
{
  "type": "presence.update",
  "audio_muted": true,
  "hand_raised": false,
  "screen_sharing": false
}
Client REST API Signaling WS SFU (UDP) POST /meetings/:id/join 200 { join_token, sfu_endpoint, signal_endpoint } WS upgrade (Authorization: Bearer join_token) participant.hello + sdp_offer SDP offer relay SDP answer sfu.answer ICE candidates (STUN/TURN) + DTLS handshake SRTP/UDP media stream (audio + video) forwarded streams from other participants
Join sequence in full. REST handles authentication once; WebSocket carries signaling throughout the call; UDP carries media after the DTLS handshake. The SFU never touches the REST API — the JWT is self-validating.

4 — Evaluation & latency budget

Control vs media plane: why the separation holds

The REST control plane sees at most one request per participant per join — it scales with meeting creation events, not with call duration or packet rate. It can live behind a standard HTTP gateway with ordinary database-backed auth. The UDP media plane, by contrast, handles thousands of packets per second per participant (audio at 50 pps, 720p video at ~30 fps). Routing those through an HTTP layer would add at minimum one TCP round trip — 50 ms or more — per packet, which is incompatible with real-time audio perception. Lesson 05 (Sockets) covers why TCP's head-of-line blocking is lethal for media.

Latency budget for the join flow

StepProtocolTypical latencyNotes
POST /meetings/:id/joinHTTPS / TCP40–80 msOne DB read for meeting, JWT signing. Can be cached.
WebSocket upgradeTCP20–40 msOne extra round trip (101 Switching Protocols).
SDP offer/answer via signalingWebSocket10–30 msSignaling server relays to SFU; answer sent back.
ICE candidate gatheringSTUN/TURN / UDP50–200 msWorst case: symmetric NAT requires TURN relay.
DTLS handshakeUDP (DTLS 1.2)20–60 ms1–2 round trips for key exchange.
First audio frameSRTP/UDP≈ 140–410 ms totalBudget dominated by ICE, especially behind symmetric NAT.

SFU fan-out under load

At 25 cameras × 1.5 Mbps each, the SFU handles 37.5 Mbps of inbound media. If it forwards each stream to all 24 other participants: 24 × 25 × 1.5 = 900 Mbps outbound. In practice, SFUs use simulcast (clients send three resolution tiers) and subscriber-side bandwidth estimation to send each recipient only what their connection can absorb. A well-implemented SFU sends each participant at most its own subscribed bandwidth, collapsing the 900 Mbps theoretical figure to roughly the sum of each participant's available downlink.

🎯 Interview angle

Interviewers often ask "why not just use HTTP for everything?" The sharp answer is: HTTP is TCP-based, and TCP guarantees delivery by retransmitting lost packets. In audio, a retransmit arrives 50–100 ms too late to play — you're better off interpolating over the gap. That's why real-time media uses UDP and tolerates loss rather than retransmitting. Signaling is different: a lost "participant left" message would leave the UI wrong. TCP delivery guarantees are worth the slight latency there.

⚠️ Common trap

Putting media routing through the signaling WebSocket. WebSocket is TCP-backed — a large video keyframe burst can cause head-of-line blocking that stalls the control messages trying to share the same connection. Always run media on a dedicated UDP transport; keep the WebSocket exclusively for signaling events.

✅ Do this, not that

Do issue short-lived JWTs from the REST API and validate them on the media server without a callback. Don't have the SFU call the REST API on every participant connect — that couples the media hot path to your database and introduces 40–80 ms of latency plus a correlated failure mode: your DB going down kills active calls.

Under the hood: the core mechanism

Three architectural facts determine everything about how a real SFU works. Understanding them lets you reason about scaling, latency, and failure modes without guessing.

Why mesh breaks: N(N-1) streams

In a pure peer-to-peer mesh each participant opens a direct connection to every other participant. With N active cameras, each client uploads N-1 streams and downloads N-1 streams:

ParticipantsStreams per clientTotal streams in meetingUpload demand at 1.5 Mbps
21 up + 1 down21.5 Mbps
43 up + 3 down124.5 Mbps (saturates home upload)
109 up + 9 down9013.5 Mbps (impossible on most lines)
2524 up + 24 down60036 Mbps — catastrophic

The SFU collapses the upload cost to exactly 1 stream per sender regardless of participant count. Each client opens one UDP connection to the SFU and sends one stream; the SFU fans it out to all subscribers. Download cost per subscriber equals the number of streams they choose to receive — typically the N-1 other participants — but that load now falls on the SFU's outbound bandwidth, not the sender's upload.

Signaling (WebSocket/TCP) vs media (WebRTC/UDP): two completely separate pipes

These are not different message types on the same connection — they are different transport protocols with different network sockets:

PropertySignaling (WebSocket)Media (WebRTC/UDP)
TransportTCP (inside WS)UDP (SRTP)
DeliveryGuaranteed, orderedBest-effort, unordered
Loss handlingTCP retransmitsLost frame = interpolate / conceal
Message sizeTens to hundreds of bytes1–1200 byte RTP packets at 50 pps audio, 30 fps video
VolumeHandful of events per second for the whole callThousands of packets per second per participant
Port443 (WSS)Negotiated via ICE; often 443 UDP via TURN

A signaling message that delivers a "mute" event 80 ms late is acceptable. An audio packet delivered 80 ms late has already missed its playback slot — it is more disruptive than silence. These fundamentally different tolerance profiles demand separate transports.

Worked trace: 3 participants join in sequence

Alice is already in the room. Bob joins. Then Carol joins. Trace every stream that the SFU handles:

─── Alice joins (already in room) ───────────────────────────────
Alice → SFU:  1 upstream  (Alice's video)
SFU stores:   track[Alice] = stream_A

─── Bob joins ────────────────────────────────────────────────────
Bob  → REST API:  POST /v1/meetings/mtg_9kZxp2/join
REST → Bob:       { join_token, sfu_endpoint }

Bob  → Signaling: WS upgrade + participant.hello + sdp_offer
Signaling → SFU:  relay SDP offer
SFU → Signaling:  sdp_answer (SFU will accept Bob's upstream + subscribe to Alice)
Signaling → Bob:  sfu.answer

# ICE + DTLS handshake establishes Bob ↔ SFU UDP path
Bob → SFU:  1 upstream (Bob's video)       ← new stream in
SFU → Bob:  Alice's stream (forwarded)      ← 1 stream out to Bob
SFU → Alice: Bob's stream (forwarded)       ← 1 stream out to Alice

Signaling → Alice: room.participant_joined { id: Bob }  ← UI update

Total SFU state:  track[Alice], track[Bob]
Streams flowing: 2 upstreams in, 2 downstreams out (one per subscriber)

─── Carol joins ──────────────────────────────────────────────────
Carol → REST, Signaling, SFU:  same join flow

Carol → SFU:  1 upstream (Carol's video)
SFU → Carol:  Alice's stream + Bob's stream   ← 2 streams out to Carol
SFU → Alice:  Carol's stream                   ← Alice now gets 2 downstreams total
SFU → Bob:    Carol's stream                   ← Bob now gets 2 downstreams total

Signaling → Alice, Bob: room.participant_joined { id: Carol }

Final SFU state (3 participants, all cameras on):
  Upstreams:   3  (one per sender, fixed at O(1) per client)
  Downstreams: 6  (each participant receives 2 others' streams)
  Compare mesh: 6 upstreams + 6 downstreams = 12 connections, all from client machines

When simulcast is enabled, each sender actually sends 3 resolution tiers (e.g. 1080p, 360p, 180p). The SFU forwards only the tier that matches each subscriber's available bandwidth — determined by RTCP receiver reports. This collapses bandwidth further without the sender doing any extra work beyond the 3-tier upload.

SFU forwards selectively Alice 1 upload Bob 1 upload Carol 1 upload Alice recv Bob + Carol Bob recv Alice + Carol Carol recv Alice + Bob
Each participant uploads exactly once (amber). The SFU fans out two streams per subscriber (green). Upload cost is fixed at O(1) per sender; download cost scales with the number of subscribed streams, but that burden falls on the SFU, not on the senders.

Operating & debugging it

A video call failure looks different depending on which plane broke. The single most useful first step is identifying whether the symptom is in the signaling plane (WebSocket), the media plane (UDP/SRTP), or the join flow (REST + ICE). Each has distinct observable symptoms and distinct tools.

Production inspection

# 1. Check the REST join endpoint is healthy curl -i -X POST https://api.example.com/v1/meetings/mtg_9kZxp2/join \ -H "Authorization: Bearer $USER_TOKEN" HTTP/1.1 200 OK ← join_token, sfu_endpoint, signal_endpoint returned HTTP/1.1 503 ← REST API down; participant cannot start the join flow # 2. Verify the signaling WebSocket accepts connections wscat -c "wss://sig-eu-west.example.com" \ -H "Authorization: Bearer $JOIN_TOKEN" Connected (press CTRL+C to quit) ← signaling reachable error: connect ETIMEDOUT ← firewall or server down # 3. Probe the STUN/TURN server (ICE health check) stunclient stun.example.com 3478 Binding test: pass (server reflexive address returned) ← STUN reachable Binding test: failed (no response) ← STUN port blocked # 4. In Chrome: inspect WebRTC internals for an active call chrome://webrtc-internals → select the PeerConnection → check ICE state, candidatePair state (should be "succeeded"), audio/video packetsReceived, and jitterBufferDelay. packetsSent > 0 confirms upstream is flowing. # 5. SFU-side: monitor active stream counts (Prometheus example) sfu_active_upstreams{meeting="mtg_9kZxp2"} 3 ← one per camera-on participant sfu_active_downstreams{meeting="mtg_9kZxp2"} 6 ← 3 participants × 2 each sfu_packet_loss_percent{track="pt_Uw83"} 0.3 ← healthy; alert above 5%

Symptom → cause → fix

SymptomLikely causeWhere to lookFix
"Join" button spins, never connects REST API down or join token request failing Browser Network tab → POST /v1/meetings/:id/join status code Check REST API health; if 401 confirm user token is valid; if 503 scale API tier
Joins REST but hangs on "Connecting…" ICE gathering stuck — STUN unreachable or TURN needed but not configured chrome://webrtc-internals → ICE candidates; "relay" candidate present? Ensure TURN server is in ice_servers response; test TURN port 443 UDP is open
One-way audio (can hear others, they can't hear me) Upstream to SFU not flowing; microphone permission denied or wrong device webrtc-internals → outbound-rtp audio → packetsSent stuck at 0 Confirm microphone permission; check SFU inbound stream count for that participant
Frozen video after 30–60 s, audio fine Video bitrate exceeded subscriber's bandwidth; simulcast not configured or SFU not adapting webrtc-internals → qualityLimitationReason = "bandwidth"; RTCP receiver reports Enable simulcast; ensure SFU reads RTCP feedback and downgrades forwarded tier
Signaling events delayed (mute lag, join notifications slow) WebSocket message queue backup; signaling server overloaded Signaling server metrics → WS message queue depth, handler latency Scale signaling tier; check for expensive operations blocking the WS event loop
Participant sees themselves in their own video grid SFU forwarding sender's stream back to themselves SFU subscription table: is sender subscribed to their own track? Exclude self-track from subscription list when building subscriber routes
Media drops completely when signaling message is large Large SDP or screen-share track renegotiation on WebSocket stalls signaling TCP buffer, causing apparent media stall WS frame sizes; check for 100 KB+ SDP messages during renegotiation Separate media from signaling; never route media over the WS connection
⚠️ The ICE restart trap

When a participant's IP changes mid-call (Wi-Fi handoff, VPN toggle), the existing ICE candidate pair becomes invalid. WebRTC supports ICE restart — a new offer/answer exchange that gathers fresh candidates — but it must be triggered through the signaling channel. If the signaling WebSocket is also broken (e.g. the device was offline briefly), neither ICE restart nor signaling recovery can proceed without reconnecting the WebSocket first. Always implement exponential-backoff WebSocket reconnection before attempting ICE restart: the signaling channel is the prerequisite for media recovery.

🧠 Quick check

1. Why does video conferencing media use UDP rather than TCP?

TCP's retransmit guarantee is a liability for real-time media. A packet that arrives 80 ms late is useless — you've already passed the playback deadline. UDP lets the application decide how to handle loss (interpolate, conceal) rather than waiting for a retransmit.

2. In an SFU architecture, how many upstream video streams does each client send?

The SFU's job is to break the N×N mesh. Each client uploads once to the SFU; the SFU fans that stream out to subscribers. Upload cost is O(1) per client, not O(N).

3. A join token expires 5 minutes after issue. What problem does this solve?

Meeting IDs are public. The token is the secret — it binds a specific user to a specific meeting at a specific time. A short expiry limits replay attacks: capturing a token and trying to join later fails because the token has expired.

4. In the join latency budget, which step typically contributes the most variance?

ICE gathering involves probing multiple candidate paths (host, server-reflexive, relay). When TURN is required — typically behind a corporate symmetric NAT — gathering can take 150–200 ms on its own, dwarfing every other step in the join sequence.

✍️ Exercise: design the "raise hand" feature end to end

A participant clicks "Raise hand." Within 500 ms, every other participant's UI should show a hand icon next to that person's name. No media plane involvement. Sketch the full flow: which plane carries it, what the message looks like, and how the server broadcasts it.

Model answer:

Raise hand is a presence event — it carries no audio or video, it is small, and delivery matters (a dropped event would leave the UI inconsistent). It belongs entirely on the signaling WebSocket.

// Client → signaling server
{ "type": "presence.update", "hand_raised": true }

// Signaling server → broadcast to all participants in room
{
  "type": "room.presence_changed",
  "participant_id": "pt_Uw83",
  "hand_raised": true
}

The server also updates its in-memory room state so latecomers receive current presence in their initial room snapshot. No REST API call needed — this is an in-call event, not a meeting lifecycle event.

Rubric: ✓ Routed through signaling, not REST ✓ Correctly uses a broadcast pattern ✓ Noted latejoiner state sync ✓ Did not involve the media plane ✓ Identified why TCP delivery guarantee matters here (presence consistency).

Key takeaways

Sources & further reading