API Design

Production at Scale · Walkthrough 04

Real-time chat fan-out at millions of connections (modeled)

Every open chat window is a persistent socket. When millions of users are connected simultaneously, the central problem is no longer throughput — it is connection state: where each user's socket lives, how many sockets one server can hold, and how a message crosses the boundary between servers to reach the right socket.

⏱ 22 min advanced Prereq: WebSockets, Pub/sub

By the end you'll be able to

The core constraint: connection state is per-server memory

HTTP is stateless: any server in the fleet can answer any request, because no state is needed between calls. WebSocket connections break that assumption. Each persistent socket is a live piece of OS state — a file descriptor (fd), kernel send/receive buffers, and application-layer session data — that lives on exactly one server. You cannot route a message to a user unless you know which server holds their socket.

This is fundamentally different from scaling a REST API, where you add servers and distribute load freely. Scaling chat means tracking connection location and routing messages across server boundaries — the engineering cost that HTTP never had.

Per-connection cost: memory and file descriptors

Each WebSocket connection consumes three types of resource on the server:

ResourceCost per connection (illustrative)Notes
File descriptor (fd)1 fdOS limit: ulimit -n; default 1024–65536; tunable to ~1 M per process
Kernel TCP buffers (send + receive)~4–8 KBConfigurable via SO_SNDBUF / SO_RCVBUF; default Linux 4 KB each
Application-layer session state~2–10 KBUser id, auth token, room membership, read cursor, heartbeat timer
Total (conservative estimate)~10–20 KBHighly implementation-dependent; tuned systems report 10–100 KB range
⚠️ Modeled, not measured — per-connection cost varies widely

Published figures for WebSocket memory per connection range from under 10 KB (minimal server, small buffers) to over 100 KB (large application state, untuned buffers). The numbers here are illustrative midpoints for sizing purposes. Always benchmark your own stack at load before provisioning hardware.

How many connection servers? The arithmetic

Given the per-connection cost, the number of connection servers needed for a target concurrent connection count follows directly from available memory per server and the fd limit.

⚠️ Modeled, not measured — illustrative sizing

# Assumptions
mem_per_conn_kb       = 20           # ~20 KB total per connection (conservative)
server_ram_gb         = 32           # 32 GB RAM server, dedicate 28 GB to connections
usable_ram_kb         = 28 × 1_024 × 1_024 = 29_360_128 KB

# Connections per server (memory-limited)
conn_per_server_mem   = 29_360_128 / 201_468_006  # ~1.4 M

# File descriptor limit (must also raise ulimit)
fd_limit              = 1_000_000   # tunable; some systems reach 10 M with epoll
conn_per_server_fd    = 1_000_000   # fd-limited at ~1 M per process

# Practical ceiling: min of memory and fd limits
conn_per_server       = min(1_468_006, 1_000_000) ≈ 1_000_000  # ~1 M

# Number of servers for 10 M concurrent connections
target_concurrent     = 10_000_000
servers_needed        = 10_000_000 / 1_000_000 = 10  # ~10 servers
# With 2× redundancy for failover: ~20 servers

# For 100 M concurrent connections
servers_100M          = 100_000_000 / 1_000_000 = 100  # ~100 servers (+ redundancy)

The result — roughly 10–100 connection servers for 10–100 million concurrent users — is a surprisingly small fleet. The challenge is not raw server count; it is the routing problem: how does a message originating on server 3 reach a recipient whose socket lives on server 17?

🎯 Interview angle

"How would you design a chat system for 10 million concurrent users?" is a classic system design question. Anchor your answer with the connection-server arithmetic: ~20 KB per connection → ~1 M connections per tuned server → ~10 servers. Then immediately pivot to the routing problem: this is where most interviewers probe depth. The routing problem — not the server count — is the hard part.

The routing problem: crossing server boundaries

User A (on server 3) sends a message to user B (on server 17). Server 3 receives the WebSocket frame, but B's socket is not on server 3. How does the message get there?

The naive approach — server 3 directly calls server 17 over HTTP — requires every connection server to know about every other connection server, and to maintain a routing table of which user is on which server. With 10–100 servers this is manageable, but the table must be kept consistent as users connect, disconnect, and reconnect. A centralised pub/sub bus solves this more cleanly.

Connection Server 1 User A ←→ User C ←→ User E ←→ subscribed to room:42 Pub/Sub Bus (e.g. Redis Pub/Sub or Kafka topic) channel: room:42 Connection Server 2 ←→ User B ←→ User D ←→ User F subscribed to room:42 publish deliver ① User A sends message in room:42 ② Server 1 publishes to channel room:42 on the bus ③ Bus fans out to all servers subscribed to room:42 ④ Server 2 receives it, writes frame to User B's socket
A pub/sub bus decouples sender from receiver. Server 1 publishes to room:42; every server subscribed to that channel (including Server 2) receives the message and delivers it to locally-connected members. No server needs to know which server holds each user's socket.

Pub/sub bus design: channels, subscriptions, and delivery guarantees

Each connection server subscribes to one channel per room (or per user) that has an active member on it. When a message arrives on a channel, the bus delivers it to all subscribed servers; each server then fans the message out to the local sockets in that room.

Design dimensionCommon choiceTrade-off
Channel granularityOne channel per room (or per user for DMs)Coarser = fewer subscriptions; finer = more precise fan-out
Delivery semanticsAt-most-once (Redis Pub/Sub) or at-least-once (Kafka, NATS JetStream)At-most-once is simpler and faster; at-least-once needs idempotent handling
OrderingPer-channel monotonic (Kafka partition) or unordered (Redis P/S)Ordering needed for message sequence ids; harder with multi-partition Kafka
Bus throughputRedis: ~1 M msgs/sec per node; Kafka: tens of millions/sec with partitioningRedis simpler; Kafka durable + replayable
Presence awarenessStore user_id → server_id in shared store (Redis Hash) on connect/disconnectStale if server crashes; mitigate with TTL + heartbeat
✅ Subscribe per room, not per user

A server that subscribes to a channel per user must maintain one channel subscription for each locally-connected user — potentially hundreds of thousands of subscriptions per server. Subscribing per room (where the room has multiple members) collapses many user subscriptions into one: a server subscribes to a room once and delivers to all local members. The number of rooms with active local members is much smaller than the number of local users.

Presence and message ordering

Presence is the signal "user X is online." Tracking it requires a shared store updated on connect and disconnect. A Redis Hash keyed by user_id with a value of {server_id, last_heartbeat_ts} works well: the connection server writes on connect, deletes on clean disconnect, and refreshes the heartbeat every 30 seconds. A separate sweeper removes stale entries (heartbeat > 90 s old) to handle unclean disconnections.

Message ordering is harder. The pub/sub bus delivers messages to multiple servers and the order of delivery is not guaranteed to match insertion order. Each message should carry a monotonically increasing sequence id scoped to the room, generated at the point of acceptance (before publishing to the bus). The server that receives the send request assigns the id atomically (e.g., Redis INCR on a per-room counter). Clients order their local message list by sequence id, not by arrival time. Gaps in the sequence trigger a client-side "fetch missed messages" request against the message store.

# Message acceptance flow (server that receives the client's send)
def accept_message(room_id, sender_id, body):
    seq_id = redis.incr(f"seq:{room_id}")         # atomic, monotonic
    msg = {
        "id":      generate_uuid(),
        "seq":     seq_id,
        "room":    room_id,
        "sender":  sender_id,
        "body":    body,
        "ts":      utcnow_ms(),
    }
    db.insert("messages", msg)                  # durable store first
    redis.publish(f"room:{room_id}", json(msg)) # then fan-out
    return {"seq": seq_id, "status": "sent"}

# Client side: detect and fill gaps
if received_msg.seq > local_last_seq + 1:
    fetch_missed(room_id, since_seq=local_last_seq)  # backfill from REST endpoint

Worked: group message fan-out for a 500-member room

A message is sent in a room with 500 members. Derive the full fan-out cost from message acceptance to final socket write.

Sender 1 msg Server A assigns seq_id publishes to bus 1 publish Pub/Sub Bus channel: room:42 10 subscribed servers 10 deliveries Srv B ~50 members local Srv C ~50 members local Srv D… ~50 members local sockets sockets sockets 1 inbound → 1 bus publish → 10 server deliveries → 499 socket writes Fan-out ratio: 499× for a 500-member group
A single inbound message fans out: one publish to the bus, delivery to each connection server subscribed to the room, then local socket writes to each member connected to that server. The fan-out ratio equals (group_size − 1).
⚠️ Modeled, not measured — group fan-out arithmetic

# Setup
group_size           = 500      # members in the room
connection_servers   = 10       # total connection servers
members_per_server   = 500 / 10 = 50   # assume even distribution

# Step 1: one inbound WebSocket frame from sender
inbound_frames       = 1

# Step 2: accepting server publishes to bus (1 write)
bus_publishes        = 1

# Step 3: bus delivers to all subscribed servers
#          (all 10 servers subscribe to room:42)
bus_deliveries       = 10

# Step 4: each server writes to its local room members' sockets
#          server A: 50 members - 1 sender = 49 writes
#          servers B–J: 50 writes each
socket_writes_total  = 49 + (9 × 50) = 499   # = group_size - 1

# Fan-out ratio
fan_out_ratio        = 499 / 1 = 499×

# Throughput: 1,000 messages/sec to a 500-person group
# requires: 1,000 × 499 = 499,000 socket writes/sec across the fleet
fleet_socket_writes  = 1_000 × 499 = 499_000 writes/sec
per_server           = 499_000 / 10 = 49_900 writes/sec   # very achievable

# Large group (5,000 members): same message rate costs 5,000× more
large_group_writes   = 1_000 × 4_999 = 4_999_000 writes/sec
per_server_large     = 4_999_000 / 10 = 499_900 writes/sec   # heavier
⚠️ Fan-out scales linearly with group size

A group of 5,000 members produces 10× more socket writes per message than a group of 500. Platforms that allow very large groups (communities of 50,000+) must either cap real-time fan-out (only deliver to online members in the same shard) or use a different delivery model — for example, treating large-group messages more like a social feed (fan-out-on-read) rather than real-time push. This is why many chat platforms impose member limits on real-time channels, or degrade to eventual consistency for large rooms.

Offline delivery and message persistence

A user who is offline when a message is sent has no socket — there is nothing to write to. The message must be stored durably and delivered when the user reconnects. This is architecturally separate from the real-time path:

  1. Write to durable message store before publishing to the bus (as shown in the pseudocode above). The store (e.g. Cassandra, PostgreSQL, or a purpose-built message store) is the source of truth.
  2. Record an unread indicator per (user, room) in a fast counter store — a Redis Hash or sorted set keyed by user id is common. This tells the client how many missed messages to fetch on reconnect.
  3. On reconnect: the client sends its last-seen sequence id; the server fetches messages with seq > last_seen from the durable store and delivers them before joining the live pub/sub stream.
✅ Dual-write order matters: store first, then publish

Always write to the durable store before publishing to the pub/sub bus. If you publish first and the DB write fails, real-time recipients see a message that is not in the store — it cannot be fetched on reconnect and effectively disappears. The reverse failure (DB write succeeds, bus publish fails) means the recipient doesn't see it in real-time but can fetch it on reconnect — a much better failure mode.

File descriptor and OS tuning: the constraints in detail

Each TCP connection (including WebSockets) consumes one file descriptor on the server. The Linux defaults are conservative and must be raised for connection servers holding hundreds of thousands of sockets. See the sockets fundamentals lesson for the underlying OS model.

# Check current fd limit (soft and hard) $ ulimit -Sn 1024 $ ulimit -Hn 1048576 # Soft limit = default for new processes; hard limit = ceiling # Raise the soft limit for the connection server process: $ ulimit -n 1000000 # Or persistently via /etc/security/limits.conf: # chatserver soft nofile 1000000 # chatserver hard nofile 1000000 # Kernel-level total fd limit (all processes combined) $ cat /proc/sys/fs/file-max 9223372036854775807 # On modern Linux this is effectively unlimited; it's per-process ulimit that matters # Check ephemeral port range (matters if server also makes outbound connections) $ cat /proc/sys/net/ipv4/ip_local_port_range 32768 60999 # ~28 K ephemeral ports; fine for inbound sockets, may constrain outbound connections

Beyond file descriptors, the other OS resource that constrains connection count is TCP buffer memory. Each socket has kernel send and receive buffers; their defaults are set by net.core.rmem_default and net.core.wmem_default. Reducing buffers (e.g. to 4 KB each) doubles the connection density at the cost of reduced burst throughput per connection — a good trade for chat, where individual messages are small.

🧠 Quick check

1. At approximately 20 KB per connection, how many concurrent WebSocket connections can a single 32 GB RAM server hold (assuming 28 GB available for connections)?

28 GB = 28 × 1,024 × 1,024 KB ≈ 29,360,128 KB. Divided by 20 KB/connection ≈ 1,468,006. Rounding conservatively and accounting for the fd limit of ~1 M per process, a well-tuned server holds approximately 1–1.4 M connections. The fd limit usually becomes the binding constraint first.

2. Why does a chat system need a pub/sub bus between connection servers?

The pub/sub bus solves the cross-server routing problem. Without it, the accepting server would need to query "which server holds user B's socket?" and make a direct call. The bus decouples this: publish to a room channel, every server subscribed to that channel delivers locally. No global user-to-server lookup is needed at delivery time.

3. In the worked fan-out example (500 members, 10 servers), how many total socket writes does one message require?

Fan-out ratio = group_size − 1 = 499. The bus delivers to all 10 servers, but the actual socket writes total 499: 49 on the sender's server (50 local members minus the sender) and 50 each on the other 9 servers. Excluding the sender from their own delivery is standard — the client already shows the message optimistically on send.

4. For a message that must be delivered to offline users on reconnect, the correct order of operations is:

Durable store write must precede the bus publish. If the bus publish succeeds but the DB write fails, real-time recipients see a message that no longer exists — it cannot be fetched by offline users on reconnect. Failures in the other order (DB write succeeds, bus publish fails) are recoverable: the message exists in the store and will be delivered via the reconnect backfill path.

✍️ Exercise: size a chat system for 5 M concurrent connections

You are designing the connection layer for a chat platform targeting 5 million simultaneous WebSocket connections. (a) How many connection servers do you need, assuming 20 KB per connection, 28 GB usable RAM per server, and a fd limit of 1 M per process? (b) How many pub/sub bus nodes do you need if each Redis Pub/Sub node handles 100,000 subscriptions and your rooms average 50 members (therefore each connection server subscribes to approximately 10,000 active rooms)? (c) What happens to your sizing if average group size grows from 50 to 500?

Model answer

(a) Connection servers: Memory capacity: 28 GB × 1,024 KB/MB × 1,024 MB/GB ÷ 20 KB ≈ 1.47 M connections per server. FD limit: 1 M per process. Binding constraint: fd limit at 1 M. Servers needed: 5 M ÷ 1 M = 5 servers. With 2× redundancy for failover: 10 servers.

(b) Bus nodes: Each connection server subscribes to ~10,000 active rooms. With 5 connection servers: 5 × 10,000 = 50,000 total channel subscriptions on the bus. At 100,000 subscriptions per Redis node: 50,000 ÷ 100,000 = 1 node is sufficient, with 1 replica for availability. (In practice, also consider message throughput; a busy room may saturate a single bus node before hitting subscription limits.)

(c) Group size growth: Routing the message still requires 1 bus publish and 1 delivery per subscribed server — that part is unchanged. But socket writes grow proportionally: from 49 per server to 499 per server for a 10-server fleet. At 1,000 messages/sec to a 500-member group, each server now handles ~50,000 socket writes/sec instead of ~5,000 — a 10× increase in socket write load. This may require either adding connection servers (reducing members per server) or capping real-time delivery for very large groups, degrading to a feed-style eventual delivery.

Rubric: correct binding constraint identified (fd, not RAM) — 1 pt; correct server count 5 (or 10 with redundancy) — 1 pt; bus sizing reasoning shown — 1 pt; group size impact on socket writes correctly explained — 2 pts.

Key takeaways

Sources & further reading