Foundations · Lesson 04
Latency vs throughput
These two get swapped constantly in interviews. Latency is how long one request takes. Throughput is how many requests you handle per second. They're different axes — and improving one often does nothing for the other.
By the end you'll be able to
- Define latency and throughput precisely and keep them straight under pressure.
- Use percentiles (p50, p99) instead of averages — and say why averages lie.
- Recall the order-of-magnitude "numbers every engineer should know."
The pipe
Picture water moving through a pipe. Latency is how long a single drop takes to travel from one end to the other — set mostly by the pipe's length. Throughput is how much water exits per second — set mostly by the pipe's width. A long, fat pipe has high latency and high throughput. A short, thin straw has low latency and low throughput. They don't move together.
The practical punchline: adding servers usually buys throughput, not latency. Ten cashiers instead of one serves more shoppers per minute (throughput ↑) but each shopper's own checkout takes just as long (latency unchanged). To cut latency you shorten the path: move data closer (a CDN), cache it, remove a network hop, or do less work per request.
When asked to "make it faster," always clarify which one. "Faster for a single user" → attack latency (caching, CDN, fewer hops, smaller payloads). "Handle more users" → attack throughput (horizontal scale, queues, batching). Conflating them is the single most common stumble in a performance question.
Don't measure with averages — use percentiles
Say 99 requests take 50 ms and one takes 5 seconds. The average is ~100 ms — which sounds fine and describes nobody's actual experience. That's why teams report percentiles:
- p50 (median): half of requests are faster than this. The typical experience.
- p99: 99% are faster; the slowest 1% are worse. This is the experience your unluckiest, often most valuable, users feel.
At scale the tail matters enormously: if a page makes 100 backend calls and each has a 1% chance of being slow, most page loads hit at least one slow call. Designing for p99, not the average, is a senior instinct.
Quoting "average response time" in a design discussion. Averages hide the tail and are skewed by outliers. If someone gives you only an average, ask for p99 — and watch their reaction. A flat average with an ugly p99 is a system with a hidden reliability problem.
Latency is the sum of four delays
"Latency" isn't one thing. A request's total time is the sum of four separate delays, and knowing which one dominates tells you exactly what to fix — speeding up the wrong one buys nothing.
| Delay | What it is | Driven by |
|---|---|---|
| Transmission | Time to push the bits onto the wire | Message size ÷ bandwidth — big payloads hurt |
| Propagation | Time for a bit to physically travel the distance | Distance ÷ speed of light — geography, unbeatable |
| Queuing | Time spent waiting in line at a busy hop | Congestion / load — spikes under traffic |
| Processing | Time the server actually does the work | Your code, DB queries, CPU |
Jitter is the variation in this total from one request to the next — mostly from changing queuing delay. Steady latency is easy to live with; jittery latency wrecks real-time apps like voice and video, which is why those buffer. The lever differs per delay: shrink payloads (transmission), move closer / use a CDN (propagation), add capacity or shed load (queuing), optimise code and queries (processing).
The numbers worth memorising
You don't need precision — you need the right order of magnitude so your back-of-envelope estimates land within 10×. These are the classic "latency numbers," rounded to remember:
| Operation | Rough time | Intuition |
|---|---|---|
| Read from main memory (RAM) | ~100 ns | baseline "fast" |
| Read 1 MB sequentially from RAM | ~10 µs | 100× a single read |
| SSD random read | ~100 µs | ~1000× slower than RAM |
| Round trip within a datacenter | ~0.5 ms | network is not free |
| Read 1 MB from SSD | ~1 ms | |
| Disk (HDD) seek | ~10 ms | avoid in hot paths |
| Round trip across the world | ~150 ms | limited by speed of light |
The headline ratios: memory ≫ SSD ≫ disk, and a cross-continent network hop dwarfs almost any local work. That last fact is why caching and CDNs exist: serving from a nearby cache turns a 150 ms global round trip into a sub-millisecond local one.
In estimation questions, do reason in powers of ten ("RAM ~100 ns, cross-region ~100 ms — so the network dominates"). Don't agonise over whether an SSD read is 90 or 120 µs. Interviewers want to see that you know what dominates, not that you memorised a benchmark.
Under the hood: measuring each delay with curl -w and reading a timing log
The four delays in the table above are not theoretical — they are individually measurable. curl's -w flag writes out timing variables after the transfer completes, one variable per phase. Here is the format string that maps directly to the four-delay model:
Each variable is cumulative from t=0, so to get the duration of each phase you subtract consecutive values:
Translating cumulative timestamps → per-phase durations:
DNS resolution: 0.018 s ← time_namelookup
TCP connect: 0.067 − 0.018 = 0.049 s ← propagation + queueing (one RTT)
TLS handshake: 0.119 − 0.067 = 0.052 s ← TLS round trips (1-2 RTTs)
Server processing: 0.143 − 0.119 = 0.024 s ← TTFB minus TLS done = server work
Transfer: 0.145 − 0.143 = 0.002 s ← transmission delay (body was only 312 bytes)
Dominant cost here: TLS + TCP setup = ~101 ms (propagation/queueing)
Fix: keep-alive / connection reuse eliminates TCP+TLS on subsequent requests
Computing p50 / p99 from a log. Once you have a stream of response times (e.g. from an access log or a simple loop), percentiles are straightforward to compute. Here is the pattern using command-line tools:
That p99 gap is the "tail latency" problem in action. In production, use your APM or Prometheus histogram (e.g. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))) instead of a shell loop — but the math is identical.
How to debug & inspect it — slow-endpoint triage
When an endpoint is slow, the four delays give you a structured triage: identify which phase is large, and only then look for fixes in that area. The table below maps symptoms to the delay type and the lever to pull.
| Symptom (from curl -w or APM) | Delay type | Likely cause | Fix |
|---|---|---|---|
High time_namelookup (>50 ms) | DNS (propagation) | Uncached DNS; high TTL DNS resolver; or DNS server overloaded | Lower DNS TTL for hot records; use a local DNS cache (e.g. systemd-resolved); pre-resolve in connection pool startup |
| High TCP connect time (time_connect − time_namelookup > 100 ms) | Propagation | Server is geographically distant from the client; no CDN/edge | Deploy to a region closer to users; use a CDN or anycast network |
| High TLS time (time_appconnect − time_connect > 100 ms) | Propagation + processing | TLS 1.2 requires 2 RTTs; slow certificate validation; OCSP stapling missing | Upgrade to TLS 1.3 (1 RTT); enable OCSP stapling; use session resumption |
| High TTFB − TLS time (> 100 ms for a simple read) | Processing | Slow DB query; missing index; N+1 query; CPU-bound computation; cold JVM/container | Instrument with an APM trace; add DB indexes; cache hot reads; warm containers ahead of traffic |
| High transfer time for large bodies | Transmission | Large response payload; no compression; bandwidth-constrained client | Enable gzip/brotli compression; paginate large collections; use streaming; reduce response fields |
| p99 ≫ p50 (long tail but good median) | Queuing | Thread-pool or DB connection-pool exhaustion under burst; GC pauses; lock contention | Increase connection pool size; add a queue/backpressure mechanism; tune GC; add more capacity before the bottleneck |
| Latency good on first call; spikes after idle | Processing (cold) | Container/function cold start; JIT warm-up; lazy cache fill; TCP keep-alive timeout | Use keep-alive pings; schedule warm-up requests; pre-warm caches; provision for minimum-instance-count |
Slow-endpoint triage checklist:
- Run
curl -wwith the timing format above (or pull from APM) and compute per-phase durations. - Identify the dominant phase — that is the only one worth optimising. Fixing the wrong phase wastes effort.
- If processing (TTFB) is large: add tracing inside the endpoint to find the slow DB call or computation.
- If connect or TLS is large: the fix is geographic (CDN/edge), not code.
- Collect p50 and p99. A high p99/p50 ratio (> 3×) almost always indicates queuing — look for thread-pool or DB-pool exhaustion.
- Always re-measure after each change to confirm the phase shrank, not just the total.
By the numbers
Concrete scenario: a checkout API handles 5,000 req/s with a mean service time W = 40 ms (0.04 s). A fan-out page loader calls N = 100 microservices per page, each with a p = 1% chance of hitting the slow tail (>500 ms).
Little's Law — how many requests are in flight right now?
Little's Law is the single most useful identity in queuing theory. For any stable system:
For our checkout API: L = 5,000 × 0.04 = 200 concurrent requests in-flight at steady state. That is how many threads, goroutines, or connection-pool slots must be available just to keep up. If you size the thread pool at 150, you are already 25% short at today's load.
Queueing blow-up as utilisation ρ approaches 1
Utilisation ρ = λ / μ, where μ is the server's maximum throughput (requests per second it can handle). In an M/M/1 model the average wait in queue is:
The wait multiplier ρ / (1 − ρ) blows up near ρ = 1 — this is why "90% utilised feels fine but 99% utilised is catastrophic":
| Utilisation ρ | Wait multiplier | Wait (service_time = 40 ms) | Total latency |
|---|---|---|---|
| 0.50 | 1.0× | 40 ms | 80 ms |
| 0.80 | 4.0× | 160 ms | 200 ms |
| 0.90 | 9.0× | 360 ms | 400 ms |
| 0.95 | 19.0× | 760 ms | 800 ms |
| 0.99 | 99.0× | 3,960 ms | 4,000 ms |
At ρ = 0.9 the wait alone is 9× the service time. The queue is not a linear function — it is a hyperbola that goes vertical. The practical rule: target ρ ≤ 0.7 for latency-sensitive paths so a 2× traffic spike only pushes you to ρ = 0.85 rather than over the cliff.
Fan-out tail: P(at least one slow call) = 1 − (1 − p)^N
A page that fans out to N independent services, each with a p = 1% chance of being slow on any given call:
Trace for p = 0.01 (1% per service):
| N (services called) | Formula | P(≥1 slow) | Practical effect |
|---|---|---|---|
| 1 | 1 − 0.99^1 | 1.0% | Rare — one-in-a-hundred pages affected |
| 10 | 1 − 0.99^10 | 9.6% | Nearly 1-in-10 pages hits a slow call |
| 100 | 1 − 0.99^100 | 63.4% | The majority of page loads are slow |
| 200 | 1 − 0.99^200 | 86.6% | Slow is the norm, not the exception |
At N = 100 services per page, a 1% per-service tail hits 63% of all page loads. That is why a p50 of 40 ms can coexist with a p99 of 800 ms — the average is fine; the fan-out exposes the tail on almost every complex request. (Dean & Barroso, "The Tail at Scale," CACM 2013)
Worked trace — a burst at 08:15:00 UTC
Scenario: the checkout fleet processes at μ = 6,000 req/s max capacity. A flash sale fires up at 08:15:00:
| Time (UTC) | Arrival rate λ | ρ = λ/μ | Queue wait (M/M/1) | p99 observed | Action needed? |
|---|---|---|---|---|---|
| 08:14:00 | 3,000 req/s | 0.50 | 40 ms | ~85 ms | No — headroom is comfortable |
| 08:15:00 | 5,000 req/s | 0.83 | 197 ms | ~420 ms | Watch — p99 rising fast |
| 08:15:30 | 5,700 req/s | 0.95 | 760 ms | ~1,600 ms | Alert — scale now or shed load |
| 08:16:00 | 6,200 req/s | 1.03 | unbounded | timeout | Emergency — queue growing without bound |
| 08:16:30 | 5,000 req/s | 0.83 | 197 ms | ~420 ms | Recovery after auto-scaling adds capacity |
The key lesson from the trace: by the time ρ = 0.95 is visible in metrics, you are already 30 seconds from a full queue collapse. You need to scale at the 0.83 alarm, not at the timeout.
Decision math — how much headroom do you need?
To guarantee p99 ≤ some target latency T under a 2× traffic spike, work backwards:
Translation: if your flash sales can double traffic, run at ≤ 40% utilisation in steady state. At μ = 6,000 req/s that means planning for a maximum of 6,000 × 0.40 = 2,400 req/s steady-state before you start provisioning more capacity. Equivalently, size the fleet so that peak-day traffic uses no more than half the installed capacity — a rule of thumb used by Google SRE (Google SRE Book, Chapter 22).
🧠 Quick check
1. You add 9 more identical servers behind a load balancer. What most likely improves?
More servers = a wider pipe = more requests per second (throughput). Each individual request still travels the same path, so its latency doesn't shrink.
2. Why prefer p99 over the average response time?
A few slow outliers skew the average and mask real pain. p99 captures the tail, which dominates user experience once requests fan out.
3. Roughly, which is slowest?
~150 ms cross-world vs ~100 µs SSD vs ~100 ns RAM — the global network hop is hundreds of thousands of times slower. This is why moving data closer (CDN/cache) is the biggest latency lever.
✍️ Drill: estimate a feed load
A user in London loads a feed served only from a US datacenter. The server work is ~5 ms. Roughly what end-to-end latency do they feel, and what's the cheapest fix? Decide first.
Model answer: The cross-Atlantic round trip dominates — ~80–150 ms — so the user feels ~100 ms+, not 5 ms. The server work is noise next to the network. Cheapest fix: serve from an edge/CDN closer to London (and cache the feed), turning the global hop into a local one. Adding more US servers wouldn't help this user at all — that's throughput, not latency.
Rubric: ✓ identifies the network as the dominant term ✓ ignores the tiny server time correctly ✓ proposes a latency fix (edge/cache), not a throughput one.
Key takeaways
- Latency = time for one request; throughput = requests per second. Different axes.
- More servers usually buys throughput; shortening the path (cache, CDN, fewer hops) buys latency.
- Report percentiles (p50/p99), never averages — the tail is what users feel.
- Know the orders of magnitude: RAM ≪ SSD ≪ disk, and a global network hop dominates everything.