Design Case Studies · Lesson 08
Design: Chat/Messenger API
A chat system feels simple — two people typing to each other. But behind that interface sits one of the hardest distributed-systems problems in consumer software: routing a message in milliseconds from one connection on server A to another connection on server B, reliably, at hundreds of millions of simultaneous connections, without losing a word.
By the end you'll be able to
- Explain why WebSockets are the right transport for chat and REST is the right transport for history — and use them together.
- Describe how cross-server fan-out works using a pub/sub broker, and why clock-based message ordering fails at scale.
- Design the store-and-forward path for offline users and size a rough connection budget for a 500 M DAU product.
Requirements
Before designing anything, pin what the system must do. A messenger is not a single use case — it's a cluster of related problems that pull in different directions:
- 1:1 and group chat. Two people exchanging messages privately, and groups up to thousands of participants. Group chat is harder because one outgoing message becomes N incoming deliveries.
- Delivery receipts and read receipts. The sender must know whether their message reached the recipient's device (delivered) and whether the recipient opened it (read). Both require the server to track state and push updates back to the sender.
- Presence. Online/offline status and typing indicators. Typing indicators in particular are extremely high-frequency — the server must fan them out in near-real-time without storing them permanently.
- Message ordering. Within a single conversation, messages must appear in the order they were sent, globally agreed upon. Two clients sending simultaneously must converge on the same sequence.
- Message history with pagination. A user who opens a conversation must be able to scroll back through thousands of messages efficiently, without fetching the entire log.
- Real-time delivery. The target latency for delivering a message from sender to recipient is under 100 ms end-to-end on a good network.
- Scale. The system must sustain hundreds of millions of daily active users, with tens of millions of simultaneous open connections.
Design decisions
Transport: WebSockets for real-time, REST for history
Chat is the canonical use case for WebSockets. The server must push messages to clients without waiting to be asked; polling at any useful frequency would be prohibitively expensive at scale. A WebSocket gives each connected client a persistent, bidirectional channel: the client can send a message to the server at any time, and the server can push any message back, all over a single TCP connection opened once per session.
But not everything needs a WebSocket. Message history — fetching a past conversation — is a classic request/response: the client asks, the server answers with a paginated list, and there's nothing to push. Using the WebSocket channel for history fetches would mean inventing a query/response protocol on top of it, which is just reinventing REST. Keep the right tool for each job: WebSocket for the live stream, REST for the archive.
Message ordering: monotonic server-assigned sequence IDs
The natural instinct is to timestamp messages at send time. The problem is that client clocks lie. A phone that drifted 500 ms behind, an NTP correction that jumps the clock forward — these cause messages to appear out of order on the receiving side. Two clients sending simultaneously might both stamp the same millisecond.
The correct approach is to have the server assign a monotonically increasing sequence number to every message within a conversation at the moment it is accepted. The sequence number is the canonical ordering. Clients display messages in sequence-number order, not timestamp order (though they can show timestamps as a human-readable label). This is how iMessage, Signal, and WhatsApp handle it.
Cross-server fan-out: pub/sub per-conversation channel
With millions of connections, no single machine can hold them all. You need a fleet of WebSocket servers. The moment you have more than one server, you have a routing problem: Alice's connection might be on ws-server-7 and Bob's on ws-server-3. When Alice sends a message, ws-server-7 receives it — but it has no direct socket to Bob. It needs a way to hand the message to ws-server-3.
The solution is a pub/sub broker (Redis Pub/Sub, Kafka, or a purpose-built message bus like the one WhatsApp built). Each WebSocket server subscribes to a channel named after each conversation whose members are connected to it. When a message arrives on any server, that server publishes to the conversation's channel; all other servers subscribed to that channel receive it and push it to their connected members. The broker is the backbone that makes the WebSocket fleet behave like a single logical router.
Idempotent sends with client-generated idempotency keys
WebSocket connections drop. A client sends a message, the server receives it and writes it to storage, and then the TCP connection breaks before the ACK travels back. The client has no way to know whether the message arrived. Without a guard, a retry creates a duplicate.
The fix follows the same pattern as idempotency in REST APIs: the client generates a UUID (the idempotency key) before sending. It attaches this key to the message. The server stores the key alongside the message when it first persists it. If the server sees the same key again, it skips the write and returns the original sequence number. The client retries freely; duplicates are silently collapsed.
Store-and-forward for offline users
When the recipient is offline, there is no WebSocket connection to push to. The message still needs to reach them. The server writes the undelivered message to a durable message store keyed by recipient user ID and sends a push notification (APNs for iOS, FCM for Android) to wake the device. When the device comes back online, the client reconnects via WebSocket and immediately fetches any missed messages via the REST history endpoint using a cursor pointing to the last sequence number it received. This hybrid — push notification to trigger reconnect, REST pull to retrieve missed messages — is more reliable than trying to queue messages in the push notification payload itself (which has strict size limits).
Presence service
Presence (online/offline, typing indicators) is deliberately separated from the message-delivery path. Presence state is high-churn and low-durability: it doesn't matter if you miss a "user started typing" update, and presence state expires naturally when a connection closes. A dedicated presence service (typically backed by Redis with short TTLs) handles these updates independently so a presence storm (e.g., millions of users coming online after an app launch) does not interfere with message delivery.
The API model
WebSocket connection
Clients establish a persistent session by upgrading to WebSocket with a short-lived auth token:
# Upgrade request — client includes a short-lived auth token
GET wss://chat.example.com/v1/ws?token=eyJhbGci... HTTP/1.1
Upgrade: websocket
Connection: Upgrade
# Server accepts
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
WebSocket message types
All messages over the WebSocket channel are JSON frames with a type field:
// Client → Server: send a message
{
"type": "SEND",
"idempotency_key": "c7f1a2b3-4d56-...", // client-generated UUID
"conversation_id": "conv_9kZp",
"text": "Are you free tomorrow?",
"client_timestamp": 1718908412341 // advisory only, not used for ordering
}
// Server → Client: deliver a message to the recipient
{
"type": "MESSAGE",
"message_id": "msg_7wQr",
"conversation_id": "conv_9kZp",
"seq": 1042, // server-assigned monotonic sequence number
"sender_id": "usr_a1b2",
"text": "Are you free tomorrow?",
"ts": 1718908412399 // server-assigned timestamp (display only)
}
// Server → Client: acknowledge the sender's SEND frame
{
"type": "ACK",
"idempotency_key": "c7f1a2b3-4d56-...", // echoed back for matching
"message_id": "msg_7wQr",
"seq": 1042
}
// Client → Server: typing indicator (ephemeral, not stored)
{
"type": "TYPING",
"conversation_id": "conv_9kZp",
"state": "started" // or "stopped"
}
// Server → Client: presence update
{
"type": "PRESENCE",
"user_id": "usr_c3d4",
"status": "online", // "online" | "offline" | "typing"
"conversation_id": "conv_9kZp"
}
REST endpoints
All history and metadata retrieval goes over REST, not the WebSocket channel:
# Paginated message history — cursor is the last seq the client has
GET /v1/conversations/:id/messages?cursor=1041&limit=50
Authorization: Bearer <token>
200 OK
{
"messages": [
{ "message_id": "msg_7wQr", "seq": 1042, "sender_id": "usr_a1b2",
"text": "Are you free tomorrow?", "ts": 1718908412399 },
...
],
"next_cursor": 1091, // null if no more pages
"has_more": true
}
# List the authenticated user's active conversations
GET /v1/conversations
Authorization: Bearer <token>
200 OK
{
"conversations": [
{ "conversation_id": "conv_9kZp", "type": "direct",
"last_message_preview": "Are you free tomorrow?",
"unread_count": 3, "last_seq": 1042 },
...
]
}
# Fallback: send a message via REST (offline clients, server-to-server)
POST /v1/conversations/:id/messages
Authorization: Bearer <token>
Idempotency-Key: c7f1a2b3-4d56-...
{
"text": "Are you free tomorrow?"
}
201 Created
{ "message_id": "msg_7wQr", "seq": 1042 }
# Presence lookup
GET /v1/users/:id/presence
200 OK
{
"user_id": "usr_c3d4",
"status": "online",
"last_seen": 1718908400000
}
Evaluation & latency budget
Connection state and keepalives
A WebSocket connection is a stateful TCP socket. Idle connections that go quiet can be silently dropped by NAT devices, load balancers, and mobile network middleboxes without either endpoint noticing. The standard guard is a ping/pong heartbeat: the server sends a WebSocket ping frame every 30 seconds; if no pong arrives within a deadline, the connection is declared dead and cleaned up. Clients should implement exponential backoff on reconnect — if ws-server-12 is overloaded and refusing connections, a thundering herd of immediate reconnect attempts will make it worse.
The cross-server fan-out problem in detail
Consider a direct message from Alice to Bob. Alice's device is connected to ws-server-4; Bob's to ws-server-11. Without coordination, ws-server-4 receives Alice's SEND frame but has no socket for Bob. It cannot deliver the message alone.
The pub/sub layer solves this. Every WebSocket server subscribes to a channel for each conversation whose members are currently connected to it. When Alice's message arrives on ws-server-4:
- ws-server-4 writes the message to the durable message store and assigns sequence number 1042.
- ws-server-4 publishes the message to the
conv_9kZppub/sub channel. - The pub/sub broker fans the event to all subscribers — including ws-server-11.
- ws-server-11 sees it has Bob's connection, looks up Bob's socket, and pushes the MESSAGE frame.
- ws-server-4 sends an ACK back to Alice.
For group conversations, every server that holds at least one group member subscribes to the group's channel. A message to a 200-person group published once to the broker fans out to however many servers those 200 members are spread across — typically far fewer than 200 servers, since each server holds thousands of connections.
Ordering guarantees
Sequence numbers are assigned at the moment the message is written to the durable store (step 1 above). This creates a total ordering within a conversation: no two messages share a sequence number, and the order is the order the server accepted them — not the order clients sent them. Clients must sort their local message list by seq, not by local timestamp. Any gaps in the sequence seen by a client (e.g., jumping from seq 1040 to 1043 after a reconnect) indicate missed messages the client should fetch via the history endpoint.
Back-of-envelope: concurrent connections
How many open WebSocket connections must the fleet sustain?
# Sizing the WebSocket fleet
DAU = 500_000_000 # 500 million daily active users
concurrency_rate = 0.10 # 10% online at peak (conservative)
peak_connections = 50_000_000 # 50 million simultaneous WS connections
connections_per_host = 50_000 # tuned OS file-descriptor limit
ws_hosts_needed = 1_000 # 50M / 50K = 1,000 WebSocket servers
# Memory per open connection (kernel socket + app state)
mem_per_conn = 50 # KB (kernel buffers + session context)
total_conn_mem = 2_500 # GB across the fleet — i.e., ~2.5 GB per host
# Message throughput at a modest 2 msgs/user/min active
msgs_per_sec = 50_000_000 * 2 / 60 # ≈ 1.7 million messages/second peak
These numbers explain why WhatsApp ran on surprisingly few servers — each Erlang process handled tens of thousands of lightweight connections — and why Go and Rust have become attractive for WebSocket gateway services. The bottleneck is not CPU but file descriptors and memory per connection.
In a system design interview, the cross-server fan-out problem is the crux of messenger design. Interviewers want to see that you understand why a naive fleet of WebSocket servers doesn't work without a shared routing layer. The right answer: each WebSocket server subscribes to a pub/sub channel per active conversation. A message sent to one server is published to that channel; all other servers subscribed to that channel receive it and deliver it to their connected clients. Name the broker type you'd use (Redis Pub/Sub for small scale, Kafka partitioned by conversation ID for high throughput) and explain the trade-offs — Redis is simpler and lower-latency; Kafka gives replay and stronger durability. Also mention that each pub/sub channel is scoped to a conversation, not a user, so group messages naturally fan out to all servers hosting group members.
It feels natural to order messages by the timestamp the client attached at send time. The problem is that client clocks are untrustworthy. A phone that gained 800 ms from NTP drift, a user with a manually set clock, two devices sending at the exact same millisecond — all of these produce orderings that differ between clients and can reorder the conversation differently for Alice than for Bob. Wall-clock timestamps are fine as a human-readable display label. They must never be the canonical ordering mechanism. Use server-assigned, atomically-incremented sequence numbers per conversation instead. The sequence is the truth; the timestamp is decoration.
The client generates a UUID for the message before opening the SEND frame — not after receiving the ACK. It holds onto this UUID. If the WebSocket drops between send and ACK, the client reconnects and retransmits the exact same frame with the exact same UUID. The server checks its deduplication table: if the key exists, it returns the original message_id and seq without writing a second row. The client never needs to worry about whether its message was saved once or twice — the UUID collapses all retries into a single record. This is the same idempotency key pattern as Stripe's charge endpoint, applied to WebSocket frames.
Under the hood: the core delivery mechanism
The phrase "pub/sub fan-out" hides the exact data structures and message flows that make cross-server delivery work. Here is each component in precise detail.
Per-server connection registry
Each WebSocket server maintains an in-process hash map that associates user IDs to their open socket handles:
// In-memory registry on ws-server-4 (pseudocode)
connection_registry: Map<user_id: string, socket: WebSocket> = {
"usr_alice": <socket fd=41, addr=192.0.2.10:58234>,
"usr_carol": <socket fd=42, addr=198.51.100.7:49812>,
// ... tens of thousands of entries
}
// On WebSocket connect:
connection_registry.set(user_id, socket)
// On WebSocket disconnect (or ping timeout):
connection_registry.delete(user_id)
This registry is local to one server process. It cannot be used to deliver to a user on a different server — that is exactly the problem the pub/sub layer solves.
Pub/sub channel layout
The pub/sub broker (Redis Cluster, Kafka, or a purpose-built bus) holds one channel per active conversation. Channel names are deterministic strings like conv:{conversation_id}. When a user connects, the WebSocket server subscribes to the channels for every conversation that user belongs to. When the user disconnects, the server unsubscribes from channels where that user was the last local member.
// ws-server-4 subscriptions (per-process, managed as users connect/disconnect)
subscribed_channels: Set<channel_name> = {
"conv:conv_9kZp", // Alice is in this conversation
"conv:conv_3xBm", // Carol is in this conversation
"conv:conv_7pWn", // Both Alice and Carol are in this group
}
Monotonic sequence ID assignment
Every message in a conversation gets a server-assigned sequence number. The standard implementation uses a database row with an atomic increment, or a Redis counter keyed by conversation ID:
-- Atomic sequence assignment (PostgreSQL)
INSERT INTO messages (conversation_id, sender_id, text, idempotency_key, seq)
VALUES (
'conv_9kZp',
'usr_alice',
'Are you free tomorrow?',
'c7f1a2b3-4d56-...',
nextval('conv_9kZp_seq') -- atomically incremented sequence per conversation
)
ON CONFLICT (conversation_id, idempotency_key) DO NOTHING -- idempotency guard
RETURNING seq, message_id;
-- The nextval() call is atomic: even if two messages arrive simultaneously
-- on different servers, they get distinct, ordered sequence numbers.
The sequence counter never resets and never has gaps under normal operation. A gap in the sequence seen by a client (jumping from 1040 to 1043) means messages 1041 and 1042 arrived but were not delivered — the client must fetch them via the REST cursor endpoint.
Worked cross-server delivery trace
Alice (on ws-server-4) sends a message to Bob (on ws-server-11) in conversation conv_9kZp. Full step-by-step:
Offline store-and-forward flow
If Bob is offline, ws-server-11 has no entry in its connection_registry for Bob. The message is already durably stored (the INSERT at T+2ms). The delivery service checks presence state and, finding Bob offline, dispatches a push notification via APNs/FCM with a lightweight payload (just the conversation_id and a badge count). When Bob's device wakes, the client reconnects over WebSocket and fetches missed messages:
// Client reconnect: fetch all messages after last known seq
GET /v1/conversations/conv_9kZp/messages?cursor=1038&limit=50
→ returns messages seq 1039, 1040, 1041, 1042 — the four missed messages
Operating & debugging it
Real-time systems fail in ways that are invisible to end users until messages stop arriving. These are the key production signals and how to interpret them.
Key metrics to monitor
| Metric | Where to observe | Alert threshold (example) |
|---|---|---|
| Active WebSocket connections per host | App metrics / Prometheus; OS ss -s | >45,000/host → approaching fd limit; scale out |
| Message delivery latency (p99) | Timestamp delta: ts_delivered - ts_server_accepted | >200 ms p99 → broker lag or receiver overloaded |
| Pub/sub broker consumer lag | Redis INFO replication; Kafka consumer group lag | >10,000 pending events → consumers falling behind |
| Reconnect storm rate | WebSocket upgrade request rate spike | Sudden 10× spike → probable server restart; ensure exponential backoff on client |
| Sequence gap detections | Client-side instrumentation (sequence number jumps logged) | Any persistent gaps → broker delivery missed; check idempotency table |
| Push notification delivery failures | APNs/FCM error rates | >1% token-invalid errors → stale device tokens; prune the token table |
Inspecting the delivery path
| Symptom | Likely cause | Fix |
|---|---|---|
| Messages delivered out of order on recipient's screen | Client sorting by client_timestamp instead of seq | Sort message list by seq field, not ts or local timestamp |
| Duplicate messages appearing after reconnect | Client retrying with a new UUID instead of the original idempotency key | Persist the idempotency key in client-side storage before first send; reuse it on every retry for that message |
| Message delivered to sender but not recipient | Recipient's server not subscribed to the conversation channel (connection registry stale) | Verify subscription on connect; add re-subscription on broker reconnect |
| Messages lost after broker restart | Redis Pub/Sub is fire-and-forget — in-flight events during restart are dropped | For durability, use Kafka with consumer group offsets, or Redis Streams with XREAD and persisted offsets |
| Offline user never receives push notification | Device token expired; presence service shows user as online when they are not (stale TTL) | Prune invalid APNs/FCM tokens on delivery failure callbacks; shorten presence TTL (e.g. 30 s) so stale presence expires quickly |
| Thundering herd on server restart | All clients reconnect simultaneously without backoff | Implement exponential backoff with jitter on client reconnect: delay = min(base × 2^attempt + random(0,1000ms), 30s) |
Redis Pub/Sub is at-most-once: if a subscriber is momentarily disconnected from the broker (network hiccup, Redis failover, broker restart), any messages published during that window are silently lost. For a chat system where every message must be delivered, this is unacceptable. The fix is to use Redis Streams (XADD / XREAD with consumer group offsets) or Kafka, both of which maintain a persisted log of events that consumers can replay after reconnection. The cost is a small increase in complexity and latency, but guaranteed at-least-once delivery. Combine with idempotency keys on the receiver side to collapse any duplicates produced by re-delivery.
🧠 Quick check
1. Why would polling (the client asking "any new messages?" every second) be unacceptable as the real-time delivery mechanism for a 500 M DAU messenger?
500 M users polling once per second generates 500 M requests per second — the vast majority of which return nothing. The delivery lag is also up to the poll interval (1 s). A WebSocket opens once and lets the server push the instant a message arrives, at a tiny fraction of the request overhead.
2. Alice is connected to chat-server-1 and Bob is connected to chat-server-2. Alice sends Bob a message. What mechanism routes the message from chat-server-1 to Bob's connection on chat-server-2?
Direct server-to-server TCP connections create an O(N²) mesh that breaks as the fleet grows. The pub/sub broker is the single routing backbone: every server publishes messages it receives and subscribes to channels for conversations whose members are connected to it. The broker fans the message to all relevant servers regardless of fleet size.
3. A client sends a message and the WebSocket drops before the server's ACK arrives. The client reconnects and sends the same message again. How does the server ensure the message is stored exactly once?
Content-based deduplication fails for identical messages sent twice intentionally ("yes" / "yes"). Time-based deduplication is a heuristic that can still produce duplicates. The idempotency key is the only correct approach: a UUID generated by the client before the first send, reused on every retry, that the server can look up in O(1) to collapse all retries into a single canonical record.
✍️ Exercise: design group chat fan-out at scale
Problem. A group conversation has 500 members. One member sends a message. Describe the complete delivery path, identify the read amplification problem, and explain the fan-out-on-write vs fan-out-on-read trade-off for this use case. What's your recommendation and why?
Context to consider: The 500 members may be spread across many WebSocket servers. Not all of them are online. The message must be delivered in real time to online members and stored for offline members.
Model answer.
The fan-out path: The sender's WS server (1) writes the message to the durable store with a new sequence number; (2) publishes it to the group's pub/sub channel. Every WS server that has at least one of the 500 members connected receives the message via the broker and pushes it to those members' sockets. For offline members, the message is already in the store — the push notification triggers reconnect + REST fetch.
The read amplification problem: One write to the message store needs to result in up to 500 deliveries. If the group has 500 members and all are online spread across 50 servers, the broker must fan out to all 50 servers. This is fan-out on write: the work happens at write time, proportional to the number of active members. For a group of 500 this is fine. For a group of 50,000 (say, a public channel), publishing one message triggers 50,000 delivery events — the write amplification is now O(N) in group size.
Fan-out on read alternative: Instead of pushing to all members' servers, store the message once and let each client fetch it via REST when they open the conversation. This reduces write amplification to O(1) but increases read load — every member reconnecting fetches separately. For large groups (10,000+ members) with sporadic connection patterns, fan-out on read can actually be more efficient: the message is read only when someone actually looks, not preemptively pushed to all 10,000.
Recommendation: Use hybrid fan-out with a membership threshold. For groups under ~500 members, use fan-out on write (pub/sub push to all connected members) — low enough amplification that push latency is justified. For groups above 500 members (or channels with very high churn), switch to fan-out on read: store the message, send a lightweight notification event ("conversation_id has a new message, seq 1042"), and let clients fetch the content via REST. This is the same approach Instagram uses for direct messages vs public posts.
Rubric:
- Correctly describes the pub/sub fan-out path from write to all connected servers.
- Identifies that write amplification grows O(N) with group size.
- Explains fan-out on read as the alternative and its trade-offs (less write work, more read work).
- Proposes a threshold or hybrid strategy rather than a single answer.
- Mentions that offline members are handled by the store regardless of which strategy is chosen for online members.
Four or more = strong answer. The threshold number matters less than the reasoning behind it.
Key takeaways
- WebSockets for live delivery, REST for history. Use each transport where it fits; don't build a query protocol on top of WebSocket frames.
- Server-assigned monotonic sequence numbers are the only reliable ordering mechanism — client timestamps lie because client clocks drift.
- Cross-server fan-out requires a pub/sub broker. Each WebSocket server subscribes to channels for the conversations whose members are connected to it; messages are published once and delivered to every subscribed server.
- Idempotency keys, generated by the client before the first send and reused on every retry, prevent duplicate messages when connections drop mid-send.
- Store-and-forward handles offline users: write to durable storage, send a push notification to wake the device, let the client fetch missed messages via REST cursor-pagination on reconnect.
- At 500 M DAU with 10% concurrency, the fleet needs approximately 1,000 WebSocket servers at 50,000 connections each — connection count and file descriptor limits, not CPU, are the binding constraint.
Sources & further reading
- RFC 6455 — The WebSocket Protocol · The full IETF specification covering the upgrade handshake, frame format, and protocol details.
- Signal Protocol documentation · How Signal handles end-to-end encryption, sealed sender, and asynchronous key distribution — the gold standard for private messaging.
- Building Facebook Messenger (Meta Engineering, 2011) · The original architecture post describing Erlang-based connection handling, XMPP, and the fan-out design that underpinned the first 500 M user deployment.
- Slack Engineering — Real-time messaging · How Slack's WebSocket infrastructure handles connection lifecycle, reconnection, and message ordering across distributed servers.
- Discord — How Discord stores billions of messages · The journey from MongoDB to Cassandra for message storage, covering partition strategies, read/write patterns, and the operational challenges of a 50 M+ DAU messenger.