API Design

Design Case Studies · Lesson 08

Design: Chat/Messenger API

A chat system feels simple — two people typing to each other. But behind that interface sits one of the hardest distributed-systems problems in consumer software: routing a message in milliseconds from one connection on server A to another connection on server B, reliably, at hundreds of millions of simultaneous connections, without losing a word.

⏱ ~20 min Advanced Prereq: WebSockets, Idempotency, Pub/Sub

By the end you'll be able to

Requirements

Before designing anything, pin what the system must do. A messenger is not a single use case — it's a cluster of related problems that pull in different directions:

Design decisions

Transport: WebSockets for real-time, REST for history

Chat is the canonical use case for WebSockets. The server must push messages to clients without waiting to be asked; polling at any useful frequency would be prohibitively expensive at scale. A WebSocket gives each connected client a persistent, bidirectional channel: the client can send a message to the server at any time, and the server can push any message back, all over a single TCP connection opened once per session.

But not everything needs a WebSocket. Message history — fetching a past conversation — is a classic request/response: the client asks, the server answers with a paginated list, and there's nothing to push. Using the WebSocket channel for history fetches would mean inventing a query/response protocol on top of it, which is just reinventing REST. Keep the right tool for each job: WebSocket for the live stream, REST for the archive.

Message ordering: monotonic server-assigned sequence IDs

The natural instinct is to timestamp messages at send time. The problem is that client clocks lie. A phone that drifted 500 ms behind, an NTP correction that jumps the clock forward — these cause messages to appear out of order on the receiving side. Two clients sending simultaneously might both stamp the same millisecond.

The correct approach is to have the server assign a monotonically increasing sequence number to every message within a conversation at the moment it is accepted. The sequence number is the canonical ordering. Clients display messages in sequence-number order, not timestamp order (though they can show timestamps as a human-readable label). This is how iMessage, Signal, and WhatsApp handle it.

Cross-server fan-out: pub/sub per-conversation channel

With millions of connections, no single machine can hold them all. You need a fleet of WebSocket servers. The moment you have more than one server, you have a routing problem: Alice's connection might be on ws-server-7 and Bob's on ws-server-3. When Alice sends a message, ws-server-7 receives it — but it has no direct socket to Bob. It needs a way to hand the message to ws-server-3.

The solution is a pub/sub broker (Redis Pub/Sub, Kafka, or a purpose-built message bus like the one WhatsApp built). Each WebSocket server subscribes to a channel named after each conversation whose members are connected to it. When a message arrives on any server, that server publishes to the conversation's channel; all other servers subscribed to that channel receive it and push it to their connected members. The broker is the backbone that makes the WebSocket fleet behave like a single logical router.

Idempotent sends with client-generated idempotency keys

WebSocket connections drop. A client sends a message, the server receives it and writes it to storage, and then the TCP connection breaks before the ACK travels back. The client has no way to know whether the message arrived. Without a guard, a retry creates a duplicate.

The fix follows the same pattern as idempotency in REST APIs: the client generates a UUID (the idempotency key) before sending. It attaches this key to the message. The server stores the key alongside the message when it first persists it. If the server sees the same key again, it skips the write and returns the original sequence number. The client retries freely; duplicates are silently collapsed.

Store-and-forward for offline users

When the recipient is offline, there is no WebSocket connection to push to. The message still needs to reach them. The server writes the undelivered message to a durable message store keyed by recipient user ID and sends a push notification (APNs for iOS, FCM for Android) to wake the device. When the device comes back online, the client reconnects via WebSocket and immediately fetches any missed messages via the REST history endpoint using a cursor pointing to the last sequence number it received. This hybrid — push notification to trigger reconnect, REST pull to retrieve missed messages — is more reliable than trying to queue messages in the push notification payload itself (which has strict size limits).

Presence service

Presence (online/offline, typing indicators) is deliberately separated from the message-delivery path. Presence state is high-churn and low-durability: it doesn't matter if you miss a "user started typing" update, and presence state expires naturally when a connection closes. A dedicated presence service (typically backed by Redis with short TTLs) handles these updates independently so a presence storm (e.g., millions of users coming online after an app launch) does not interfere with message delivery.

The API model

WebSocket connection

Clients establish a persistent session by upgrading to WebSocket with a short-lived auth token:

# Upgrade request — client includes a short-lived auth token
GET wss://chat.example.com/v1/ws?token=eyJhbGci... HTTP/1.1
Upgrade: websocket
Connection: Upgrade

# Server accepts
HTTP/1.1 101 Switching Protocols
Upgrade: websocket

WebSocket message types

All messages over the WebSocket channel are JSON frames with a type field:

// Client → Server: send a message
{
  "type": "SEND",
  "idempotency_key": "c7f1a2b3-4d56-...",  // client-generated UUID
  "conversation_id": "conv_9kZp",
  "text": "Are you free tomorrow?",
  "client_timestamp": 1718908412341   // advisory only, not used for ordering
}

// Server → Client: deliver a message to the recipient
{
  "type": "MESSAGE",
  "message_id": "msg_7wQr",
  "conversation_id": "conv_9kZp",
  "seq": 1042,                            // server-assigned monotonic sequence number
  "sender_id": "usr_a1b2",
  "text": "Are you free tomorrow?",
  "ts": 1718908412399                   // server-assigned timestamp (display only)
}

// Server → Client: acknowledge the sender's SEND frame
{
  "type": "ACK",
  "idempotency_key": "c7f1a2b3-4d56-...",  // echoed back for matching
  "message_id": "msg_7wQr",
  "seq": 1042
}

// Client → Server: typing indicator (ephemeral, not stored)
{
  "type": "TYPING",
  "conversation_id": "conv_9kZp",
  "state": "started"        // or "stopped"
}

// Server → Client: presence update
{
  "type": "PRESENCE",
  "user_id": "usr_c3d4",
  "status": "online",        // "online" | "offline" | "typing"
  "conversation_id": "conv_9kZp"
}

REST endpoints

All history and metadata retrieval goes over REST, not the WebSocket channel:

# Paginated message history — cursor is the last seq the client has
GET /v1/conversations/:id/messages?cursor=1041&limit=50
Authorization: Bearer <token>

200 OK
{
  "messages": [
    { "message_id": "msg_7wQr", "seq": 1042, "sender_id": "usr_a1b2",
      "text": "Are you free tomorrow?", "ts": 1718908412399 },
    ...
  ],
  "next_cursor": 1091,    // null if no more pages
  "has_more": true
}

# List the authenticated user's active conversations
GET /v1/conversations
Authorization: Bearer <token>

200 OK
{
  "conversations": [
    { "conversation_id": "conv_9kZp", "type": "direct",
      "last_message_preview": "Are you free tomorrow?",
      "unread_count": 3, "last_seq": 1042 },
    ...
  ]
}

# Fallback: send a message via REST (offline clients, server-to-server)
POST /v1/conversations/:id/messages
Authorization: Bearer <token>
Idempotency-Key: c7f1a2b3-4d56-...

{
  "text": "Are you free tomorrow?"
}

201 Created
{ "message_id": "msg_7wQr", "seq": 1042 }

# Presence lookup
GET /v1/users/:id/presence

200 OK
{
  "user_id": "usr_c3d4",
  "status": "online",
  "last_seen": 1718908400000
}

Evaluation & latency budget

Connection state and keepalives

A WebSocket connection is a stateful TCP socket. Idle connections that go quiet can be silently dropped by NAT devices, load balancers, and mobile network middleboxes without either endpoint noticing. The standard guard is a ping/pong heartbeat: the server sends a WebSocket ping frame every 30 seconds; if no pong arrives within a deadline, the connection is declared dead and cleaned up. Clients should implement exponential backoff on reconnect — if ws-server-12 is overloaded and refusing connections, a thundering herd of immediate reconnect attempts will make it worse.

The cross-server fan-out problem in detail

Consider a direct message from Alice to Bob. Alice's device is connected to ws-server-4; Bob's to ws-server-11. Without coordination, ws-server-4 receives Alice's SEND frame but has no socket for Bob. It cannot deliver the message alone.

The pub/sub layer solves this. Every WebSocket server subscribes to a channel for each conversation whose members are currently connected to it. When Alice's message arrives on ws-server-4:

  1. ws-server-4 writes the message to the durable message store and assigns sequence number 1042.
  2. ws-server-4 publishes the message to the conv_9kZp pub/sub channel.
  3. The pub/sub broker fans the event to all subscribers — including ws-server-11.
  4. ws-server-11 sees it has Bob's connection, looks up Bob's socket, and pushes the MESSAGE frame.
  5. ws-server-4 sends an ACK back to Alice.

For group conversations, every server that holds at least one group member subscribes to the group's channel. A message to a 200-person group published once to the broker fans out to however many servers those 200 members are spread across — typically far fewer than 200 servers, since each server holds thousands of connections.

Ordering guarantees

Sequence numbers are assigned at the moment the message is written to the durable store (step 1 above). This creates a total ordering within a conversation: no two messages share a sequence number, and the order is the order the server accepted them — not the order clients sent them. Clients must sort their local message list by seq, not by local timestamp. Any gaps in the sequence seen by a client (e.g., jumping from seq 1040 to 1043 after a reconnect) indicate missed messages the client should fetch via the history endpoint.

Back-of-envelope: concurrent connections

How many open WebSocket connections must the fleet sustain?

# Sizing the WebSocket fleet

DAU                = 500_000_000   # 500 million daily active users
concurrency_rate   =         0.10   # 10% online at peak (conservative)
peak_connections   =  50_000_000   # 50 million simultaneous WS connections

connections_per_host =      50_000   # tuned OS file-descriptor limit
ws_hosts_needed      =       1_000   # 50M / 50K = 1,000 WebSocket servers

# Memory per open connection (kernel socket + app state)
mem_per_conn     =   50 # KB (kernel buffers + session context)
total_conn_mem   = 2_500 # GB across the fleet — i.e., ~2.5 GB per host

# Message throughput at a modest 2 msgs/user/min active
msgs_per_sec = 50_000_000 * 2 / 60  # ≈ 1.7 million messages/second peak

These numbers explain why WhatsApp ran on surprisingly few servers — each Erlang process handled tens of thousands of lightweight connections — and why Go and Rust have become attractive for WebSocket gateway services. The bottleneck is not CPU but file descriptors and memory per connection.

Sender Client WS Server A Alice connected here Pub/Sub Broker conv_9kZp channel WS Server B Bob connected here Recipient Client ACK publish fan-out push ① SEND frame ② write + seq ③ broker routes ④ deliver
Cross-server message delivery. WS Server A publishes to the pub/sub broker; the broker fans the message to WS Server B, which holds the recipient's connection. The ACK travels back to the sender independently.
🎯 Interview angle: the fan-out problem

In a system design interview, the cross-server fan-out problem is the crux of messenger design. Interviewers want to see that you understand why a naive fleet of WebSocket servers doesn't work without a shared routing layer. The right answer: each WebSocket server subscribes to a pub/sub channel per active conversation. A message sent to one server is published to that channel; all other servers subscribed to that channel receive it and deliver it to their connected clients. Name the broker type you'd use (Redis Pub/Sub for small scale, Kafka partitioned by conversation ID for high throughput) and explain the trade-offs — Redis is simpler and lower-latency; Kafka gives replay and stronger durability. Also mention that each pub/sub channel is scoped to a conversation, not a user, so group messages naturally fan out to all servers hosting group members.

Message Store durable, seq-indexed WS Server write Online Recipient push via WS Offline Recipient no socket Push Notif. APNs / FCM wake device reconnect + GET /messages ?cursor=last_seq
Store-and-forward. Online recipients receive the message over WebSocket immediately. Offline recipients get a push notification to wake their device; on reconnect they fetch missed messages from the message store via the REST history endpoint using a sequence cursor.
⚠️ Pitfall: using wall-clock timestamps for message ordering

It feels natural to order messages by the timestamp the client attached at send time. The problem is that client clocks are untrustworthy. A phone that gained 800 ms from NTP drift, a user with a manually set clock, two devices sending at the exact same millisecond — all of these produce orderings that differ between clients and can reorder the conversation differently for Alice than for Bob. Wall-clock timestamps are fine as a human-readable display label. They must never be the canonical ordering mechanism. Use server-assigned, atomically-incremented sequence numbers per conversation instead. The sequence is the truth; the timestamp is decoration.

✅ Tip: idempotency keys prevent ghost messages

The client generates a UUID for the message before opening the SEND frame — not after receiving the ACK. It holds onto this UUID. If the WebSocket drops between send and ACK, the client reconnects and retransmits the exact same frame with the exact same UUID. The server checks its deduplication table: if the key exists, it returns the original message_id and seq without writing a second row. The client never needs to worry about whether its message was saved once or twice — the UUID collapses all retries into a single record. This is the same idempotency key pattern as Stripe's charge endpoint, applied to WebSocket frames.

Under the hood: the core delivery mechanism

The phrase "pub/sub fan-out" hides the exact data structures and message flows that make cross-server delivery work. Here is each component in precise detail.

Per-server connection registry

Each WebSocket server maintains an in-process hash map that associates user IDs to their open socket handles:

// In-memory registry on ws-server-4 (pseudocode)
connection_registry: Map<user_id: string, socket: WebSocket> = {
  "usr_alice":  <socket fd=41, addr=192.0.2.10:58234>,
  "usr_carol":  <socket fd=42, addr=198.51.100.7:49812>,
  // ... tens of thousands of entries
}

// On WebSocket connect:
connection_registry.set(user_id, socket)

// On WebSocket disconnect (or ping timeout):
connection_registry.delete(user_id)

This registry is local to one server process. It cannot be used to deliver to a user on a different server — that is exactly the problem the pub/sub layer solves.

Pub/sub channel layout

The pub/sub broker (Redis Cluster, Kafka, or a purpose-built bus) holds one channel per active conversation. Channel names are deterministic strings like conv:{conversation_id}. When a user connects, the WebSocket server subscribes to the channels for every conversation that user belongs to. When the user disconnects, the server unsubscribes from channels where that user was the last local member.

// ws-server-4 subscriptions (per-process, managed as users connect/disconnect)
subscribed_channels: Set<channel_name> = {
  "conv:conv_9kZp",   // Alice is in this conversation
  "conv:conv_3xBm",   // Carol is in this conversation
  "conv:conv_7pWn",   // Both Alice and Carol are in this group
}

Monotonic sequence ID assignment

Every message in a conversation gets a server-assigned sequence number. The standard implementation uses a database row with an atomic increment, or a Redis counter keyed by conversation ID:

-- Atomic sequence assignment (PostgreSQL)
INSERT INTO messages (conversation_id, sender_id, text, idempotency_key, seq)
VALUES (
  'conv_9kZp',
  'usr_alice',
  'Are you free tomorrow?',
  'c7f1a2b3-4d56-...',
  nextval('conv_9kZp_seq')        -- atomically incremented sequence per conversation
)
ON CONFLICT (conversation_id, idempotency_key) DO NOTHING  -- idempotency guard
RETURNING seq, message_id;

-- The nextval() call is atomic: even if two messages arrive simultaneously
-- on different servers, they get distinct, ordered sequence numbers.

The sequence counter never resets and never has gaps under normal operation. A gap in the sequence seen by a client (jumping from 1040 to 1043) means messages 1041 and 1042 arrived but were not delivered — the client must fetch them via the REST cursor endpoint.

Worked cross-server delivery trace

Alice (on ws-server-4) sends a message to Bob (on ws-server-11) in conversation conv_9kZp. Full step-by-step:

T+0ms Alice's device sends SEND frame over WebSocket to ws-server-4: { type:"SEND", idempotency_key:"c7f1a2b3...", conversation_id:"conv_9kZp", text:"Are you free tomorrow?", client_timestamp:1718908412341 } T+1ms ws-server-4 receives frame; checks dedup table for idempotency_key → not seen T+2ms ws-server-4 executes INSERT ... ON CONFLICT DO NOTHING → seq=1042 assigned T+2ms ws-server-4 publishes to broker channel "conv:conv_9kZp": { message_id:"msg_7wQr", seq:1042, sender_id:"usr_alice", text:"Are you free tomorrow?", ts:1718908412399 } T+3ms Pub/sub broker fans the event to all subscribers of "conv:conv_9kZp" → ws-server-4 (Alice's server), ws-server-11 (Bob's server) T+3ms ws-server-4 receives its own publish back; looks up connection_registry: Alice is the sender → skip (she already knows; ACK handles it separately) T+4ms ws-server-11 receives the broker event; looks up connection_registry: connection_registry.get("usr_bob") → socket fd=78 Pushes MESSAGE frame to Bob's socket T+4ms ws-server-4 sends ACK frame back to Alice: { type:"ACK", idempotency_key:"c7f1a2b3...", message_id:"msg_7wQr", seq:1042 } T+5ms Bob's device receives MESSAGE frame and renders the message End-to-end delivery: ~5 ms on the same region. Cross-region: +20–60 ms for broker hop.

Offline store-and-forward flow

If Bob is offline, ws-server-11 has no entry in its connection_registry for Bob. The message is already durably stored (the INSERT at T+2ms). The delivery service checks presence state and, finding Bob offline, dispatches a push notification via APNs/FCM with a lightweight payload (just the conversation_id and a badge count). When Bob's device wakes, the client reconnects over WebSocket and fetches missed messages:

// Client reconnect: fetch all messages after last known seq
GET /v1/conversations/conv_9kZp/messages?cursor=1038&limit=50
→ returns messages seq 1039, 1040, 1041, 1042 — the four missed messages

Operating & debugging it

Real-time systems fail in ways that are invisible to end users until messages stop arriving. These are the key production signals and how to interpret them.

Key metrics to monitor

MetricWhere to observeAlert threshold (example)
Active WebSocket connections per hostApp metrics / Prometheus; OS ss -s>45,000/host → approaching fd limit; scale out
Message delivery latency (p99)Timestamp delta: ts_delivered - ts_server_accepted>200 ms p99 → broker lag or receiver overloaded
Pub/sub broker consumer lagRedis INFO replication; Kafka consumer group lag>10,000 pending events → consumers falling behind
Reconnect storm rateWebSocket upgrade request rate spikeSudden 10× spike → probable server restart; ensure exponential backoff on client
Sequence gap detectionsClient-side instrumentation (sequence number jumps logged)Any persistent gaps → broker delivery missed; check idempotency table
Push notification delivery failuresAPNs/FCM error rates>1% token-invalid errors → stale device tokens; prune the token table

Inspecting the delivery path

# Check active WebSocket connections on a server host $ ss -s | grep -i websocket # Or count open sockets (each WS connection = one TCP socket): $ ss -tn state established | wc -l 42318 # Close to 50,000 limit → provision additional hosts # Verify a message was written to the store $ psql -c "SELECT seq, sender_id, text, ts FROM messages WHERE conversation_id='conv_9kZp' ORDER BY seq DESC LIMIT 5;" seq | sender_id | text | ts ------+------------+----------------------------+-------------------- 1042 | usr_alice | Are you free tomorrow? | 2024-06-20 17:20:12 1041 | usr_bob | Let me check | 2024-06-20 17:18:44 # Check Redis pub/sub channel subscriber count $ redis-cli PUBSUB NUMSUB conv:conv_9kZp conv:conv_9kZp 3 # 3 WS servers subscribed → correct for a 2-person conversation if they're on different servers # 0 → no server has a member connected; offline delivery path only # Check idempotency key dedup table for a specific key $ psql -c "SELECT message_id, seq FROM message_idempotency WHERE key='c7f1a2b3-4d56-...';" message_id | seq ------------+------ msg_7wQr | 1042 # If this row exists, a retry with the same key will return this row without a second INSERT
SymptomLikely causeFix
Messages delivered out of order on recipient's screenClient sorting by client_timestamp instead of seqSort message list by seq field, not ts or local timestamp
Duplicate messages appearing after reconnectClient retrying with a new UUID instead of the original idempotency keyPersist the idempotency key in client-side storage before first send; reuse it on every retry for that message
Message delivered to sender but not recipientRecipient's server not subscribed to the conversation channel (connection registry stale)Verify subscription on connect; add re-subscription on broker reconnect
Messages lost after broker restartRedis Pub/Sub is fire-and-forget — in-flight events during restart are droppedFor durability, use Kafka with consumer group offsets, or Redis Streams with XREAD and persisted offsets
Offline user never receives push notificationDevice token expired; presence service shows user as online when they are not (stale TTL)Prune invalid APNs/FCM tokens on delivery failure callbacks; shorten presence TTL (e.g. 30 s) so stale presence expires quickly
Thundering herd on server restartAll clients reconnect simultaneously without backoffImplement exponential backoff with jitter on client reconnect: delay = min(base × 2^attempt + random(0,1000ms), 30s)
⚠️ Gotcha: Redis Pub/Sub drops messages silently during network partitions

Redis Pub/Sub is at-most-once: if a subscriber is momentarily disconnected from the broker (network hiccup, Redis failover, broker restart), any messages published during that window are silently lost. For a chat system where every message must be delivered, this is unacceptable. The fix is to use Redis Streams (XADD / XREAD with consumer group offsets) or Kafka, both of which maintain a persisted log of events that consumers can replay after reconnection. The cost is a small increase in complexity and latency, but guaranteed at-least-once delivery. Combine with idempotency keys on the receiver side to collapse any duplicates produced by re-delivery.

🧠 Quick check

1. Why would polling (the client asking "any new messages?" every second) be unacceptable as the real-time delivery mechanism for a 500 M DAU messenger?

500 M users polling once per second generates 500 M requests per second — the vast majority of which return nothing. The delivery lag is also up to the poll interval (1 s). A WebSocket opens once and lets the server push the instant a message arrives, at a tiny fraction of the request overhead.

2. Alice is connected to chat-server-1 and Bob is connected to chat-server-2. Alice sends Bob a message. What mechanism routes the message from chat-server-1 to Bob's connection on chat-server-2?

Direct server-to-server TCP connections create an O(N²) mesh that breaks as the fleet grows. The pub/sub broker is the single routing backbone: every server publishes messages it receives and subscribes to channels for conversations whose members are connected to it. The broker fans the message to all relevant servers regardless of fleet size.

3. A client sends a message and the WebSocket drops before the server's ACK arrives. The client reconnects and sends the same message again. How does the server ensure the message is stored exactly once?

Content-based deduplication fails for identical messages sent twice intentionally ("yes" / "yes"). Time-based deduplication is a heuristic that can still produce duplicates. The idempotency key is the only correct approach: a UUID generated by the client before the first send, reused on every retry, that the server can look up in O(1) to collapse all retries into a single canonical record.

✍️ Exercise: design group chat fan-out at scale

Problem. A group conversation has 500 members. One member sends a message. Describe the complete delivery path, identify the read amplification problem, and explain the fan-out-on-write vs fan-out-on-read trade-off for this use case. What's your recommendation and why?

Context to consider: The 500 members may be spread across many WebSocket servers. Not all of them are online. The message must be delivered in real time to online members and stored for offline members.

Model answer.

The fan-out path: The sender's WS server (1) writes the message to the durable store with a new sequence number; (2) publishes it to the group's pub/sub channel. Every WS server that has at least one of the 500 members connected receives the message via the broker and pushes it to those members' sockets. For offline members, the message is already in the store — the push notification triggers reconnect + REST fetch.

The read amplification problem: One write to the message store needs to result in up to 500 deliveries. If the group has 500 members and all are online spread across 50 servers, the broker must fan out to all 50 servers. This is fan-out on write: the work happens at write time, proportional to the number of active members. For a group of 500 this is fine. For a group of 50,000 (say, a public channel), publishing one message triggers 50,000 delivery events — the write amplification is now O(N) in group size.

Fan-out on read alternative: Instead of pushing to all members' servers, store the message once and let each client fetch it via REST when they open the conversation. This reduces write amplification to O(1) but increases read load — every member reconnecting fetches separately. For large groups (10,000+ members) with sporadic connection patterns, fan-out on read can actually be more efficient: the message is read only when someone actually looks, not preemptively pushed to all 10,000.

Recommendation: Use hybrid fan-out with a membership threshold. For groups under ~500 members, use fan-out on write (pub/sub push to all connected members) — low enough amplification that push latency is justified. For groups above 500 members (or channels with very high churn), switch to fan-out on read: store the message, send a lightweight notification event ("conversation_id has a new message, seq 1042"), and let clients fetch the content via REST. This is the same approach Instagram uses for direct messages vs public posts.

Rubric:

Four or more = strong answer. The threshold number matters less than the reasoning behind it.

Key takeaways

Sources & further reading