API Design

Reliability & Scale · Lesson 08

Load Balancing

A single server has a ceiling. A load balancer turns many ordinary servers into one logical surface that can absorb traffic spikes, survive instance failures, and scale horizontally without clients ever knowing the difference.

⏱ 12 min Difficulty: core Prereq: Caching, API Gateway

By the end you'll be able to

Why one server is never enough

Picture a single toll booth on a highway. It works fine on a Tuesday morning. On a holiday weekend, a queue stretches for kilometres — not because the booth broke, but because one lane is a hard ceiling. The fix is to open ten booths and put a traffic officer in front who reads licence plates and waves cars to whichever lane is shortest.

That traffic officer is a load balancer. It sits in front of a group of identical server instances (the booths) and decides which one should handle each incoming request. It has three jobs: spread load evenly, steer traffic away from unhealthy instances, and do both without the caller noticing.

Clients Load Balancer Server A Server B Server C healthy healthy unhealthy skipped
The load balancer fans requests to healthy instances only; Server C is removed from rotation while its health check fails.

The four distribution algorithms

The balancer needs a rule for picking which server gets the next request. Four algorithms cover the vast majority of real deployments:

AlgorithmHow it picksBest for
Round-robin Cycles A → B → C → A → … Homogeneous instances, similar request cost
Least connections Sends to the server with the fewest open connections right now Long-lived connections (WebSockets, streaming)
Weighted Assigns more traffic to higher-capacity servers proportionally Mixed instance sizes; rolling upgrades
Consistent hashing Hashes a key (IP, user ID, session token) to a point on a ring; the next server clockwise owns it Caches, sticky routing without session state in the LB

Consistent hashing: the ring trick

Ordinary hashing — server = hash(key) % N — collapses when you add or remove a server, because N changes and almost every key remaps to a different bucket. Consistent hashing places both servers and keys on a notional ring (0–2³²). A key maps to whichever server is next clockwise on the ring. Add a server? Only the keys between the new server and its predecessor shift; the rest stay put. Remove a server? Only its keys migrate to the next server clockwise. This keeps cache hit rates high during scale events.

hash ring 0 – 2³² S-A S-B S-C key1 → S-B key2 → S-A
Each key travels clockwise to the nearest server. Adding a new server only displaces the keys between it and its predecessor.

L4 vs L7 load balancing

The OSI model layer at which the balancer operates determines what it can "see" and therefore how smart its decisions can be.

L4 — Transport layerL7 — Application layer
Works on TCP/UDP connections — IP + port only HTTP/HTTPS — headers, URL, cookies, body
Speed Very fast; minimal processing Slower; must parse full HTTP request
Can route by IP address, port Path (/api/* vs /static/*), header, cookie, method
Example use Raw TCP game servers, databases Microservices, canary deployments, A/B testing

An L7 load balancer can, for example, route all requests to /api/video to a GPU-equipped fleet and everything else to general-purpose nodes — something L4 cannot do because it never looks at the URL.

Health checks and connection draining

Health checks are periodic probes the balancer sends to each instance — usually an HTTP GET /health that returns 200 when the server is ready, or a TCP handshake for L4. If an instance fails N consecutive checks, the balancer removes it from rotation and stops sending new requests to it. When it recovers, it re-enters automatically.

Connection draining (sometimes called "deregistration delay") is what happens when you intentionally take a server out of rotation for a deploy. Rather than cutting existing connections immediately, the balancer lets in-flight requests finish (typically up to 30–60 seconds) before fully removing the instance. Clients never see an abrupt disconnection.

# Minimal NGINX upstream health-check config (simplified)
upstream api_servers {
    least_conn;                    # algorithm
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
    server 10.0.0.3:8080;
}

server {
    location /api/ {
        proxy_pass http://api_servers;
        proxy_connect_timeout 2s;    # fail fast on dead instance
        proxy_next_upstream error timeout;
    }
}
# Active health checks require NGINX Plus or the health_check module

Sticky sessions

Some applications store per-user state in server memory (shopping cart, WebSocket context). If round-robin sends the same user to a different instance each time, that state is lost. Sticky sessions (also called session affinity) solve this by tagging each user — often via a cookie — and always routing them to the same instance.

The trade-off: stickiness re-introduces the uneven-distribution problem. If one "sticky" instance gets a surge of heavy users, the others sit idle. The cleaner long-term answer is to move session state out of the server (to Redis, a database, or a JWT) so that any instance can serve any user — making stickiness unnecessary.

Load balancer vs API gateway

Students often conflate these two. They overlap but are not the same. An API gateway (see the API gateway lesson) focuses on API-level cross-cutting concerns: authentication, rate limiting, request transformation, routing to different microservices, and API versioning. A load balancer focuses on distributing traffic across identical replicas of one service. In practice, modern systems often layer them: the gateway handles API logic at the edge, and a load balancer sits behind it distributing traffic within each service cluster.

🎯 Interview angle

"How would you distribute traffic across multiple server instances?" is a classic system design opener. Walk through: (1) put a load balancer in front, (2) choose an algorithm (round-robin for stateless; consistent hashing if routing to a cache tier), (3) add health checks so dead instances are removed automatically, (4) ensure servers are stateless so any instance can serve any request. Mentioning connection draining for zero-downtime deploys signals production awareness.

⚠️ Common trap

Running stateful servers behind a round-robin load balancer without sticky sessions. A user logs in on Server A, which stores the session in memory. The next request hits Server B — the session is gone and they're unexpectedly logged out. The real fix is stateless servers (externalise session state), not patching it with stickiness forever.

✅ Do this, not that

Do configure health checks on every server in your upstream pool and set proxy_next_upstream error timeout so NGINX retries on failure. Don't skip health checks and just hope all instances stay healthy forever — one silent crash will send ~1/N of your traffic into a black hole until someone notices the error rate spike.

Under the hood: how each algorithm actually picks a backend

Round-robin — a simple counter

The load balancer maintains a single atomic integer, the next-index counter. Every time a new connection arrives it reads next_index % N to pick the backend, then increments the counter. On a three-node pool the sequence is 0, 1, 2, 0, 1, 2, … The counter never resets — it just wraps via modulo. This is why adding or removing a server changes which backend gets the next request, but never changes the per-request cost.

pool = [A, B, C] # N = 3 counter = 0 request 1 → pool[0 % 3] = A counter → 1 request 2 → pool[1 % 3] = B counter → 2 request 3 → pool[2 % 3] = C counter → 3 request 4 → pool[3 % 3] = A counter → 4 # Add server D (N becomes 4): request 5 → pool[4 % 4] = A (D never picked until counter reaches 3, 7, 11…)

Least-connections — a per-backend counter

Each backend has a live active connection count. On each new request the balancer scans the pool and picks the entry with the lowest count; it increments that counter when the connection opens and decrements it when it closes. For short HTTP requests the counters fluctuate so fast that the result approximates round-robin. The algorithm shows its value for long-lived connections (WebSockets, streaming): a slow backend accumulates a higher counter and receives fewer new connections automatically.

state: {A: 5 conns, B: 2 conns, C: 8 conns} new request → pick min → B (2) state: {A:5, B:3, C:8} new request → pick min → B (3) state: {A:5, B:4, C:8} new request → pick min → A (5) state: {A:6, B:4, C:8} A closes a conn → state: {A:5, B:4, C:8}

Consistent hashing — a worked ring example

Hash space is the integers 0–2³² (about 4.3 billion). Both servers and keys are placed on this ring by hashing their identifiers. A key maps to the first server clockwise from its position.

Setup (3 servers, ring positions simplified to 0–100):

hash("server-A") = 10 → ring position 10 hash("server-B") = 45 → ring position 45 hash("server-C") = 80 → ring position 80 hash("user:1001") = 22 → first server clockwise from 22 = server-B (pos 45) hash("user:2055") = 60 → first server clockwise from 60 = server-C (pos 80) hash("user:3077") = 90 → first server clockwise from 90 = server-A (pos 10, wraps)

Adding server-D at position 35:

New ring: A@10, B@45, C@80, D@35 user:1001 (pos 22) → first clockwise = D@35 (moved from B to D) user:2055 (pos 60) → first clockwise = C@80 (unchanged) user:3077 (pos 90) → first clockwise = A@10 (unchanged) # Only keys in the arc (10 → 35) remapped. All others stayed put. # Fraction of keys remapped ≈ 1/(N+1) = 25% vs 100% with modulo hashing.

In practice, each server is given multiple virtual nodes (typically 100–200 per real server) spread around the ring. Without virtual nodes, a 3-server ring leaves large arcs unbalanced: the server with position 10 owns arc 80→10 (30 units), while the one at 45 only owns 10→45 (35 units). Virtual nodes make the distribution statistically uniform.

0 – 100 S-A pos 10 S-B pos 45 S-C pos 80 S-D pos 35 (new) key:1001 (22) → S-D key:2055 (60) → S-C (unchanged) After adding S-D: Only keys in arc 10→35 remapped (≈ 25% of keys) All other keys stay on same node
Adding server-D at position 35 only displaces keys in the arc from S-A (pos 10) to S-D (pos 35). All other keys remain on their current server — cache hit rates stay high during scale-out.

Health-check mechanics and connection draining

A typical active health check sends an HTTP GET /healthz every 5–10 s. The balancer tracks a rolling window: after N consecutive failures the backend is marked unhealthy and removed from the pool; after M consecutive successes it is re-added. The thresholds (N=2, M=3 are common) prevent flapping on transient timeouts.

Connection draining runs when a backend is voluntarily taken out (deploy, scale-down). The balancer stops routing new requests to it, waits up to the drain timeout (typically 30–60 s) for in-flight requests to complete, then severs the connection. From the client's perspective no request is interrupted.

# Sequence during a rolling deploy of backend B: t=0 B enters draining state — LB stops sending new requests to B t=0 B still has 12 open connections (long-polling, streaming) t=8 connections close naturally as requests finish t=30 drain timeout reached — LB forcibly closes remaining connections t=30 B removed from pool; deploy proceeds t=90 B rejoins after deploy, passes 3 health checks t=90 B re-added to active pool — traffic resumes

How to debug & inspect load balancing

The key insight: to know which backend served a request, the backend (or the load balancer) must add a header. Most production load balancers (AWS ALB, NGINX, HAProxy) can be configured to inject an X-Upstream-Server or X-Served-By header. Without it you are guessing.

# NGINX: add upstream server info to response header add_header X-Upstream-Server $upstream_addr; # Then verify distribution with curl in a loop: $ for i in $(seq 1 9); do curl -si https://api.example.com/healthz | grep -i x-upstream done X-Upstream-Server: 10.0.0.1:8080 X-Upstream-Server: 10.0.0.2:8080 X-Upstream-Server: 10.0.0.3:8080 X-Upstream-Server: 10.0.0.1:8080 # Even distribution = round-robin working. Skewed = a problem (see table). # Consistent hashing: confirm the same key always hits the same backend: $ for i in 1 2 3; do curl -si -H "X-User-Id: 1001" https://api.example.com/v1/data \ | grep -i x-upstream done X-Upstream-Server: 10.0.0.2:8080 # same every time = correct X-Upstream-Server: 10.0.0.2:8080 X-Upstream-Server: 10.0.0.2:8080
SymptomLikely causeFix
One backend receives far more traffic than others (hot node) Round-robin with persistent connections: HTTP/2 or HTTP keep-alive reuses the same TCP connection, so one connection runs many requests to the same backend Switch to least-connections; or use per-request (not per-connection) balancing at L7; or use NGINX's keepalive limit
Cache miss rate spikes when a node is added to the pool Using modulo hashing (hash(key) % N) — all keys rehash when N changes Switch to consistent hashing; use virtual nodes for even distribution
Users are randomly logged out after a deploy In-memory session state; round-robin routes the user to a different backend after the old one restarts Externalise session state to Redis; or ensure draining completes before backend restarts
Health checks pass but real requests fail on a backend Health endpoint is too shallow (just returns 200, doesn't check DB connectivity or downstream deps) Add a deep health check that exercises key dependencies; return 503 if any dependency is unhealthy
Requests time out to a backend the LB still considers healthy Health check interval is long (10 s) and the backend failed between checks; or health endpoint is on a different port/process than the app Shorten health-check interval + reduce failure threshold; run health check on the same process/port as the app
Deploy causes a spike of 502/503 errors for 5–30 seconds Backend removed from pool before draining, or new instance accepts connections before it is ready Increase drain timeout; add a readiness check separate from liveness; only add backend to pool after N successful health checks

Debug checklist for uneven distribution / hot node:

  1. Add X-Upstream-Server header to responses (NGINX: add_header X-Upstream-Server $upstream_addr).
  2. Run 100 requests via curl in a loop and count how many hit each backend (| sort | uniq -c).
  3. Check whether persistent connections (keep-alive / HTTP/2) are routing multiple requests per connection — switch to per-request balancing or least-connections.
  4. If using consistent hashing, verify the key being hashed is uniformly distributed (user IDs are; IP addresses less so — many users behind a single NAT IP collapse to one backend).
  5. Inspect the load balancer's backend metric counters (e.g. nginx_upstream_requests_total in Prometheus) — a sudden drop to zero on one backend means it failed a health check.

By the numbers

Make the math concrete. Scenario: a product-search cache tier at 12,000 req/s, currently running on 4 nodes (N=4). Each node can handle at most 4,000 req/s before latency degrades.

Formula: fraction remapped when adding the (N+1)-th node

Consistent hashing remaps approximately 1/(N+1) of keys when a new node is added. Naive modulo hashing (hash(key) % N) remaps nearly all keys because N changes.

fraction_remapped ≈ 1 / (N + 1) [consistent hashing] fraction_remapped ≈ 1 − 1/N [modulo hashing — almost everything moves]

Scale-out remap table

At 12,000 req/s, each remap is a cache miss — the request falls through to the origin database. The table shows the business impact of each node addition:

Before → AfterConsistent hashing: % remappedConsistent hashing: miss spike (req/s)Modulo hash: % remappedModulo hash: miss spike (req/s)
4 → 5 nodes1/5 = 20%12,000 × 0.20 = 2,400≈ 75%≈ 9,000
9 → 10 nodes1/10 = 10%12,000 × 0.10 = 1,200≈ 90%≈ 10,800
19 → 20 nodes1/20 = 5%12,000 × 0.05 = 600≈ 95%≈ 11,400

With consistent hashing, a 4→5 scale-out dumps ~2,400 req/s back to the origin for one replication cycle. With modulo hashing, it dumps ~9,000 req/s — likely exceeding the origin's capacity. (Google Maglev paper demonstrates this trade-off at planetary scale.)

Virtual nodes: reducing load variance

Without virtual nodes, three servers placed arbitrarily on a 0–100 ring produce unequal arc lengths. Virtual nodes (vnodes) give each real server multiple ring positions, making the distribution approach uniform.

# 3 real servers, no virtual nodes — actual arc sizes: S-A owns arc 80→10 = 30 units (30% of keys) ← heavy S-B owns arc 10→45 = 35 units (35% of keys) ← heavy S-C owns arc 45→80 = 35 units (35% of keys) Imbalance: max/min = 35/30 ≈ ±17% deviation from equal (33%) # With 100 virtual nodes per server (300 ring positions total): Each server owns ≈100 arcs, each ≈1 unit → statistical average ≈33% Observed imbalance empirically drops to ≈ ±2–5% (Dynamo paper, §4.3)
vnodes per serverTypical load imbalance (±% from equal)
1 (no vnodes)±15–25%
10±8–12%
100±2–5%
200+±1–2%

Source: DeCandia et al. — Dynamo: Amazon's Highly Available Key-Value Store, §4.3 (virtual node load distribution).

Hot-shard math: when one key dominates traffic

If a single key ("trending product") generates 30% of total traffic, its assigned node absorbs 30% of 12,000 = 3,600 req/s. The remaining 11 nodes split the other 70% = 8,400 req/s ÷ 11 = 764 req/s each. Consistent hashing does not solve hot shards — it only solves even distribution of normal keys.

hot_node_load = total_QPS × hot_key_traffic_fraction = 12,000 × 0.30 = 3,600 req/s # 3,600 req/s exceeds the 4,000 req/s per-node budget → only 11% headroom left # Adding more nodes does NOT help: the hot key always maps to one ring position. # Fix: local in-process cache for the hot key, or read replicas for that shard.

Decision math: how many nodes to stay within capacity

Per-node load at N nodes (ignoring hot shards, assuming even distribution): per_node = total_QPS / N. Solve for the minimum N that keeps each node under its capacity limit of 4,000 req/s:

N ≥ ceil(total_QPS / per_node_capacity) = ceil(12,000 / 4,000) = 3 nodes (minimum) With 10% hot-key factor, effective QPS on worst node ≈ total_QPS × (1/N + hot_fraction) N=4: 12,000×(0.25 + 0.10) = 12,000×0.35 = 4,200 req/s ← over budget! N=5: 12,000×(0.20 + 0.10) = 12,000×0.30 = 3,600 req/s ← safe # Add a hot-shard buffer: target ≤80% utilisation → N ≥ ceil(12,000 / (4,000×0.80)) = 4 nodes minimum, # but with a 30% hot key, need N=5 to stay under 80% on the hot node.

Rule of thumb: size to <80% utilisation per node at expected peak, then add one extra node as buffer for hot keys. At 12,000 req/s with a known 30%-traffic hot key, 5 nodes is the break-even; 6 gives comfortable headroom.

🧠 Quick check

1. You have a caching tier where it matters that the same cache key always reaches the same server. Which algorithm should you use?

Consistent hashing maps each key to a stable server on the ring. Adding or removing a server only remaps the keys nearest to that server, preserving cache hit rates far better than modulo-based hashing.

2. An L7 load balancer can do something an L4 balancer cannot. What is it?

L7 balancers parse the full HTTP request, so they can inspect URL paths, headers, and cookies. L4 balancers only see IP addresses and ports — they cannot look inside the HTTP envelope.

3. What does "connection draining" accomplish during a rolling deploy?

Connection draining (deregistration delay) stops sending new requests to a departing instance while allowing its open connections to complete naturally — zero mid-request failures for users during deploys.

4. A team stores shopping-cart data in the server process memory and enables sticky sessions to cope. What is the fundamental problem they have not solved?

Stickiness routes a user to the same instance while it's alive, but if that instance crashes, the in-memory cart is gone. The real fix is externalising session state (Redis, DB) so any instance can reconstruct it.

✍️ Exercise: design the load-balancing tier for a live streaming service

A live-streaming platform has two types of backend nodes: ingest servers (accept the streamer's video upload via persistent RTMP connection) and delivery servers (serve HLS video chunks to viewers over HTTP). Design the load-balancing strategy for each tier, noting algorithm choice and session/stickiness requirements.

Model answer:

  1. Ingest tier — use consistent hashing on the stream ID. Each RTMP connection is long-lived and must land on the same ingest server for the entire broadcast. Consistent hashing keeps the stream on one node; adding a new ingest server only redistributes future streams, not live ones.
  2. Delivery tier — use least-connections with stateless servers. Each HLS chunk request is short and independent. Stateless chunk servers can serve any viewer. Least-connections prevents overloading servers that are slow on high-definition segments.
  3. Health checks on both tiers. Ingest: TCP check on port 1935. Delivery: HTTP GET /healthz. Connection draining on delivery (30 s); drain time on ingest must be zero (RTMP connections should be migrated via reconnect logic, not drained).
  4. Placement: an L4 load balancer for ingest (TCP, no HTTP); an L7 load balancer for delivery (can split /live/* to delivery and /admin/* to control nodes).

Rubric: ✓ different algorithms for different tiers ✓ justifies persistent-connection need at ingest ✓ stateless delivery servers ✓ health checks on both ✓ correct L4/L7 choice ✓ notes drain complexity for RTMP. Five or more = strong answer.

Key takeaways

Sources & further reading