API Design

Foundations · Lesson 04

Latency vs throughput

These two get swapped constantly in interviews. Latency is how long one request takes. Throughput is how many requests you handle per second. They're different axes — and improving one often does nothing for the other.

⏱ 10 minDifficulty: corePrereq: Lesson 02

By the end you'll be able to

The pipe

Picture water moving through a pipe. Latency is how long a single drop takes to travel from one end to the other — set mostly by the pipe's length. Throughput is how much water exits per second — set mostly by the pipe's width. A long, fat pipe has high latency and high throughput. A short, thin straw has low latency and low throughput. They don't move together.

length → LATENCY (time for one drop end-to-end) width → THROUGHPUT
Length = latency, cross-section = throughput. Widening the pipe (more servers) raises throughput but doesn't shorten the trip.

The practical punchline: adding servers usually buys throughput, not latency. Ten cashiers instead of one serves more shoppers per minute (throughput ↑) but each shopper's own checkout takes just as long (latency unchanged). To cut latency you shorten the path: move data closer (a CDN), cache it, remove a network hop, or do less work per request.

🎯 Interview angle

When asked to "make it faster," always clarify which one. "Faster for a single user" → attack latency (caching, CDN, fewer hops, smaller payloads). "Handle more users" → attack throughput (horizontal scale, queues, batching). Conflating them is the single most common stumble in a performance question.

Don't measure with averages — use percentiles

Say 99 requests take 50 ms and one takes 5 seconds. The average is ~100 ms — which sounds fine and describes nobody's actual experience. That's why teams report percentiles:

At scale the tail matters enormously: if a page makes 100 backend calls and each has a 1% chance of being slow, most page loads hit at least one slow call. Designing for p99, not the average, is a senior instinct.

⚠️ Common trap

Quoting "average response time" in a design discussion. Averages hide the tail and are skewed by outliers. If someone gives you only an average, ask for p99 — and watch their reaction. A flat average with an ugly p99 is a system with a hidden reliability problem.

Latency is the sum of four delays

"Latency" isn't one thing. A request's total time is the sum of four separate delays, and knowing which one dominates tells you exactly what to fix — speeding up the wrong one buys nothing.

DelayWhat it isDriven by
TransmissionTime to push the bits onto the wireMessage size ÷ bandwidth — big payloads hurt
PropagationTime for a bit to physically travel the distanceDistance ÷ speed of light — geography, unbeatable
QueuingTime spent waiting in line at a busy hopCongestion / load — spikes under traffic
ProcessingTime the server actually does the workYour code, DB queries, CPU

Jitter is the variation in this total from one request to the next — mostly from changing queuing delay. Steady latency is easy to live with; jittery latency wrecks real-time apps like voice and video, which is why those buffer. The lever differs per delay: shrink payloads (transmission), move closer / use a CDN (propagation), add capacity or shed load (queuing), optimise code and queries (processing).

The numbers worth memorising

You don't need precision — you need the right order of magnitude so your back-of-envelope estimates land within 10×. These are the classic "latency numbers," rounded to remember:

OperationRough timeIntuition
Read from main memory (RAM)~100 nsbaseline "fast"
Read 1 MB sequentially from RAM~10 µs100× a single read
SSD random read~100 µs~1000× slower than RAM
Round trip within a datacenter~0.5 msnetwork is not free
Read 1 MB from SSD~1 ms
Disk (HDD) seek~10 msavoid in hot paths
Round trip across the world~150 mslimited by speed of light

The headline ratios: memory ≫ SSD ≫ disk, and a cross-continent network hop dwarfs almost any local work. That last fact is why caching and CDNs exist: serving from a nearby cache turns a 150 ms global round trip into a sub-millisecond local one.

✅ Do this, not that

In estimation questions, do reason in powers of ten ("RAM ~100 ns, cross-region ~100 ms — so the network dominates"). Don't agonise over whether an SSD read is 90 or 120 µs. Interviewers want to see that you know what dominates, not that you memorised a benchmark.

Under the hood: measuring each delay with curl -w and reading a timing log

The four delays in the table above are not theoretical — they are individually measurable. curl's -w flag writes out timing variables after the transfer completes, one variable per phase. Here is the format string that maps directly to the four-delay model:

$ curl -s -o /dev/null \ -w "dns_lookup: %{time_namelookup}s\ntcp_connect: %{time_connect}s\ntls_handshake: %{time_appconnect}s\nttfb: %{time_starttransfer}s\ntotal: %{time_total}s\nbytes_down: %{size_download} bytes\n" \ https://api.example.com/v1/users/42 dns_lookup: 0.018s tcp_connect: 0.067s tls_handshake: 0.119s ttfb: 0.143s total: 0.145s bytes_down: 312 bytes

Each variable is cumulative from t=0, so to get the duration of each phase you subtract consecutive values:

Translating cumulative timestamps → per-phase durations:

DNS resolution:     0.018 s          ← time_namelookup
TCP connect:        0.067 − 0.018 = 0.049 s   ← propagation + queueing (one RTT)
TLS handshake:      0.119 − 0.067 = 0.052 s   ← TLS round trips (1-2 RTTs)
Server processing:  0.143 − 0.119 = 0.024 s   ← TTFB minus TLS done = server work
Transfer:           0.145 − 0.143 = 0.002 s   ← transmission delay (body was only 312 bytes)

Dominant cost here: TLS + TCP setup = ~101 ms (propagation/queueing)
Fix: keep-alive / connection reuse eliminates TCP+TLS on subsequent requests

Computing p50 / p99 from a log. Once you have a stream of response times (e.g. from an access log or a simple loop), percentiles are straightforward to compute. Here is the pattern using command-line tools:

# Collect 200 samples of total time into a file $ for i in $(seq 1 200); do curl -s -o /dev/null -w "%{time_total}\n" https://api.example.com/v1/users/42 done > latencies.txt # Sort and extract p50 (line 100) and p99 (line 198 of 200) $ sort -n latencies.txt | awk ' NR==100 { printf "p50: %s s\n", $1 } NR==198 { printf "p99: %s s\n", $1 } ' p50: 0.143s p99: 0.821s # p99 is ~5.7× the median — a typical sign of queueing under bursty load

That p99 gap is the "tail latency" problem in action. In production, use your APM or Prometheus histogram (e.g. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))) instead of a shell loop — but the math is identical.

How to debug & inspect it — slow-endpoint triage

When an endpoint is slow, the four delays give you a structured triage: identify which phase is large, and only then look for fixes in that area. The table below maps symptoms to the delay type and the lever to pull.

Symptom (from curl -w or APM)Delay typeLikely causeFix
High time_namelookup (>50 ms)DNS (propagation)Uncached DNS; high TTL DNS resolver; or DNS server overloadedLower DNS TTL for hot records; use a local DNS cache (e.g. systemd-resolved); pre-resolve in connection pool startup
High TCP connect time (time_connect − time_namelookup > 100 ms)PropagationServer is geographically distant from the client; no CDN/edgeDeploy to a region closer to users; use a CDN or anycast network
High TLS time (time_appconnect − time_connect > 100 ms)Propagation + processingTLS 1.2 requires 2 RTTs; slow certificate validation; OCSP stapling missingUpgrade to TLS 1.3 (1 RTT); enable OCSP stapling; use session resumption
High TTFB − TLS time (> 100 ms for a simple read)ProcessingSlow DB query; missing index; N+1 query; CPU-bound computation; cold JVM/containerInstrument with an APM trace; add DB indexes; cache hot reads; warm containers ahead of traffic
High transfer time for large bodiesTransmissionLarge response payload; no compression; bandwidth-constrained clientEnable gzip/brotli compression; paginate large collections; use streaming; reduce response fields
p99 ≫ p50 (long tail but good median)QueuingThread-pool or DB connection-pool exhaustion under burst; GC pauses; lock contentionIncrease connection pool size; add a queue/backpressure mechanism; tune GC; add more capacity before the bottleneck
Latency good on first call; spikes after idleProcessing (cold)Container/function cold start; JIT warm-up; lazy cache fill; TCP keep-alive timeoutUse keep-alive pings; schedule warm-up requests; pre-warm caches; provision for minimum-instance-count

Slow-endpoint triage checklist:

  1. Run curl -w with the timing format above (or pull from APM) and compute per-phase durations.
  2. Identify the dominant phase — that is the only one worth optimising. Fixing the wrong phase wastes effort.
  3. If processing (TTFB) is large: add tracing inside the endpoint to find the slow DB call or computation.
  4. If connect or TLS is large: the fix is geographic (CDN/edge), not code.
  5. Collect p50 and p99. A high p99/p50 ratio (> 3×) almost always indicates queuing — look for thread-pool or DB-pool exhaustion.
  6. Always re-measure after each change to confirm the phase shrank, not just the total.

By the numbers

Concrete scenario: a checkout API handles 5,000 req/s with a mean service time W = 40 ms (0.04 s). A fan-out page loader calls N = 100 microservices per page, each with a p = 1% chance of hitting the slow tail (>500 ms).

Little's Law — how many requests are in flight right now?

Little's Law is the single most useful identity in queuing theory. For any stable system:

L = λ · W L = average number of requests in-flight (concurrency) λ = arrival rate (req/s) W = average time a request spends in the system (seconds)

For our checkout API: L = 5,000 × 0.04 = 200 concurrent requests in-flight at steady state. That is how many threads, goroutines, or connection-pool slots must be available just to keep up. If you size the thread pool at 150, you are already 25% short at today's load.

Queueing blow-up as utilisation ρ approaches 1

Utilisation ρ = λ / μ, where μ is the server's maximum throughput (requests per second it can handle). In an M/M/1 model the average wait in queue is:

wait = ρ / (1 − ρ) × service_time

The wait multiplier ρ / (1 − ρ) blows up near ρ = 1 — this is why "90% utilised feels fine but 99% utilised is catastrophic":

Utilisation ρWait multiplierWait (service_time = 40 ms)Total latency
0.501.0×40 ms80 ms
0.804.0×160 ms200 ms
0.909.0×360 ms400 ms
0.9519.0×760 ms800 ms
0.9999.0×3,960 ms4,000 ms

At ρ = 0.9 the wait alone is 9× the service time. The queue is not a linear function — it is a hyperbola that goes vertical. The practical rule: target ρ ≤ 0.7 for latency-sensitive paths so a 2× traffic spike only pushes you to ρ = 0.85 rather than over the cliff.

Fan-out tail: P(at least one slow call) = 1 − (1 − p)^N

A page that fans out to N independent services, each with a p = 1% chance of being slow on any given call:

P(≥1 slow) = 1 − (1 − p)^N

Trace for p = 0.01 (1% per service):

N (services called)FormulaP(≥1 slow)Practical effect
11 − 0.99^11.0%Rare — one-in-a-hundred pages affected
101 − 0.99^109.6%Nearly 1-in-10 pages hits a slow call
1001 − 0.99^10063.4%The majority of page loads are slow
2001 − 0.99^20086.6%Slow is the norm, not the exception

At N = 100 services per page, a 1% per-service tail hits 63% of all page loads. That is why a p50 of 40 ms can coexist with a p99 of 800 ms — the average is fine; the fan-out exposes the tail on almost every complex request. (Dean & Barroso, "The Tail at Scale," CACM 2013)

Worked trace — a burst at 08:15:00 UTC

Scenario: the checkout fleet processes at μ = 6,000 req/s max capacity. A flash sale fires up at 08:15:00:

Time (UTC)Arrival rate λρ = λ/μQueue wait (M/M/1)p99 observedAction needed?
08:14:003,000 req/s0.5040 ms~85 msNo — headroom is comfortable
08:15:005,000 req/s0.83197 ms~420 msWatch — p99 rising fast
08:15:305,700 req/s0.95760 ms~1,600 msAlert — scale now or shed load
08:16:006,200 req/s1.03unboundedtimeoutEmergency — queue growing without bound
08:16:305,000 req/s0.83197 ms~420 msRecovery after auto-scaling adds capacity

The key lesson from the trace: by the time ρ = 0.95 is visible in metrics, you are already 30 seconds from a full queue collapse. You need to scale at the 0.83 alarm, not at the timeout.

Decision math — how much headroom do you need?

To guarantee p99 ≤ some target latency T under a 2× traffic spike, work backwards:

# After a 2× spike, the new ρ must stay below your comfort ceiling: target_ρ_after_spike = peak_ρ × 2 → peak_ρ must be ≤ comfort_ceiling / 2 # For comfort_ceiling = 0.80 (wait ≤ 4× service_time): peak_ρ ≤ 0.80 / 2 = 0.40 → target utilisation at steady load ≤ 40% # For comfort_ceiling = 0.70 (a more conservative target): peak_ρ ≤ 0.70 / 2 = 0.35

Translation: if your flash sales can double traffic, run at ≤ 40% utilisation in steady state. At μ = 6,000 req/s that means planning for a maximum of 6,000 × 0.40 = 2,400 req/s steady-state before you start provisioning more capacity. Equivalently, size the fleet so that peak-day traffic uses no more than half the installed capacity — a rule of thumb used by Google SRE (Google SRE Book, Chapter 22).

🧠 Quick check

1. You add 9 more identical servers behind a load balancer. What most likely improves?

More servers = a wider pipe = more requests per second (throughput). Each individual request still travels the same path, so its latency doesn't shrink.

2. Why prefer p99 over the average response time?

A few slow outliers skew the average and mask real pain. p99 captures the tail, which dominates user experience once requests fan out.

3. Roughly, which is slowest?

~150 ms cross-world vs ~100 µs SSD vs ~100 ns RAM — the global network hop is hundreds of thousands of times slower. This is why moving data closer (CDN/cache) is the biggest latency lever.

✍️ Drill: estimate a feed load

A user in London loads a feed served only from a US datacenter. The server work is ~5 ms. Roughly what end-to-end latency do they feel, and what's the cheapest fix? Decide first.

Model answer: The cross-Atlantic round trip dominates — ~80–150 ms — so the user feels ~100 ms+, not 5 ms. The server work is noise next to the network. Cheapest fix: serve from an edge/CDN closer to London (and cache the feed), turning the global hop into a local one. Adding more US servers wouldn't help this user at all — that's throughput, not latency.

Rubric: ✓ identifies the network as the dominant term ✓ ignores the tiny server time correctly ✓ proposes a latency fix (edge/cache), not a throughput one.

Key takeaways

Sources & further reading