Production at Scale · Simulator 06

Capacity & the latency hockey-stick

Why do engineers keep servers at 60–70% utilization and refuse to "just run them hotter"? The M/M/1 queueing model answers it: latency rises hyperbolically as utilization approaches 100%, producing the famous hockey-stick curve. Drag QPS up, watch the dot climb the curve — and watch the cost.

InteractiveDrag the slidersModels rel-14, rel-13

Curve: response time vs utilization (M/M/1). Current operating point shown as a dot (●). The vertical dashed line marks ρ = target utilization. Notice how small increases in ρ near 1.0 produce massive latency increases.

What's happening — the math

Given a target utilization u, you need enough servers so no single server is overloaded. The M/M/1 model (single-queue, exponential service) gives the mean response time:

# Minimum servers to keep utilization ≤ target
servers      = ceil(QPS / (perServer × targetUtil / 100))

# Actual utilization with that fleet size
ρ            = QPS / (servers × perServer)

# M/M/1 mean response time (service time + queueing delay)
responseTime = serviceMs / (1 − ρ)           # diverges as ρ → 1

# Illustrative monthly cost ($0.10/hr per server, 730 hr/month)
monthlyCost  = servers × $0.10 × 730

The key insight: responseTime = serviceMs / (1 − ρ) has a singularity at ρ = 1. At 50% utilization you pay 2× the service time; at 90% you pay 10×; at 99% you pay 100×. That is why keeping utilization at ≤ 70% is not waste — it is the headroom that absorbs traffic spikes without blowing latency SLOs.

✅ Try this

1. Set target utilization to 95% — notice how the response time blows up near that operating point and how little headroom you have for spikes. 2. Drop it back to 65% — you need more servers and it costs more, but the operating point sits in the flat part of the curve, well away from the knee. 3. Raise QPS to 100M with a small per-server capacity — watch servers and cost scale up. 4. Increase per-server capacity (vertical scaling) vs increasing server count (horizontal) — compare the cost change.

⚠️ Modeled, not measured

This uses the M/M/1 queueing approximation (Poisson arrivals, exponential service, single queue). Real systems have multiple queues, non-exponential service times, connection limits, and GC pauses that change the shape of the curve — but the qualitative behaviour (latency diverges as ρ → 1) holds universally. The $0.10/hr cost is purely illustrative. Treat numbers as directional, not operational.

Sources & further reading

Scaling strategies · Capacity estimation (the lessons this models)
M/M/1 queue — Wikipedia
Google SRE — Addressing cascading failures (utilization & overload)
AWS — Scale your web application