Performance · Lesson 02
Latency budgets
A latency budget is a promise you make to yourself before you build: "this entire response will take no more than X ms at p99, and here is exactly how that X is divided among every component that touches the request." It is the single most effective tool for preventing the death-by-a-thousand-cuts that turns a fast system into a slow one over time.
By the end you'll be able to
- Define a latency budget and explain why budgets outperform ad-hoc performance review.
- Allocate a p99 target across a realistic multi-hop call graph, accounting for serial vs parallel work and tail amplification from fan-out.
- Identify an over-budget component in a concrete worked example and name the architectural levers to fix it.
What is a latency budget?
Think of household finances. You have $3,000 a month and you divide it: $1,200 rent, $400 food, $200 transport, $300 savings, remainder for discretionary. You can't spend $2,000 on rent and still make the numbers work — the total is fixed, so every allocation is a trade-off against the others.
A latency budget works identically. You pick a target — say, p99 < 300 ms — and divide it among every component in the request path: client network, API gateway, your service's logic, a cache hit, a cache miss (database), and any downstream calls. The components must sum to ≤ 300 ms. If someone adds a new external call without evicting something else, the budget is broken.
The target itself usually comes from user-experience research and product SLOs: under 100 ms feels instant, 100–300 ms feels responsive, 300 ms–1 s feels sluggish, over 1 s users disengage. A p99 < 300 ms is a defensible goal for most interactive APIs.
Why budget at all?
Three concrete reasons budgets beat ad-hoc performance review:
- Forces trade-offs up front. "We want to add a fraud-detection call on every checkout" is easy to approve in a feature review. It's harder to approve when the budget already shows that the payment service has 20 ms left and the fraud call costs 40 ms. The budget makes the cost visible before the commit.
- Catches regressions automatically. If you instrument each component and alert when any component exceeds its allocation, a slow dependency surfaces in staging — not three months after it ships as a customer complaint about checkout times.
- Aligns teams without coordination overhead. The auth team, the payment team, and the inventory team can each own their allocation independently. No weekly meeting needed: as long as each team stays within their slice, the overall contract holds.
The anatomy of a budget: client → gateway → service → storage → downstream
A typical web API request passes through five layers. Here is a worked allocation for a p99 target of 300 ms:
| Layer | Component | Allocated (ms) | Notes |
|---|---|---|---|
| Client ↔ Edge | Network (client to nearest PoP) | 25 | Assume regional users on good connections |
| Edge ↔ Gateway | CDN / TLS offload | 5 | TLS handshake amortised by connection reuse |
| Gateway | API gateway (auth, rate-limit, routing) | 10 | Token validation from in-memory cache |
| Service | Business logic (no I/O) | 10 | Pure CPU — serialise, validate, transform |
| Cache | Redis (same DC) — hit path | 2 | ~80% of requests; negligible |
| Storage | Postgres (same DC) — miss path | 30 | ~20% of requests; indexed query |
| Downstream | Downstream pricing service | 50 | Separate microservice, same region |
| Response path | Network (edge back to client) | 25 | Symmetric with inbound |
| Headroom | Tail variance, GC, queue jitter | 143 | Difference to reach 300 ms p99 target |
| Total | 300 | ||
The largest single allocation — 143 ms of headroom — is not slack; it is the acknowledged gap between your median estimates and the p99 tail. The tail factor captures GC pauses, OS scheduling jitter, queue back-pressure spikes, and the variance in your slowest database queries. If your measurements consistently use less of the headroom, you can tighten the target. If they frequently exceed it, you have a budget over-run that demands architectural attention.
Diagram 1: stacked budget across components
Serial vs parallel: why it changes everything
The most consequential decision in latency budgeting is which calls are serial and which are parallel. Get it wrong and your budget arithmetic is off by 2×.
Serial: call A must finish before call B can start (B needs A's result, or you've written them that way). Total time = sum of all calls.
Parallel: calls A and B are dispatched simultaneously, neither waits on the other. Total time = max(A, B) — you only pay for the slowest one.
Fan-out and tail amplification
Parallel calls save time on average, but they hide a trap: tail amplification. If a single call has a 1% chance of being slow, and you make 10 parallel calls, the probability that at least one is slow is roughly 10% — your composite p99 has become the p90 of an individual call.
Formally: if each call is independently slow with probability q, N parallel calls are all fast with probability (1-q)^N. The complement is the probability of at least one slow call. For q=0.01 (1%) and N=10, that's 1-(0.99)^10 ≈ 9.6%. You've paid the tail rate 10× more often. See Lesson 04 for the percentile foundations.
This is why fan-out architectures — BFF patterns, GraphQL resolvers calling N services, search suggestions that fan out to N ranking models — can have terrible p99 even when each individual call is fast.
You have 5 downstream services each with p99 = 50 ms. You think your composite p99 is 50 ms because they run in parallel. In fact it's closer to the p99.9 of a single call — the probability that none of five services is in its tail is (0.99)^5 ≈ 95%, meaning 5% of composite requests will hit at least one tail. Your p95 equals their individual p99. Budget for the composite tail, not the individual one.
Where to spend or cut the budget
Once you have an allocation, you can reason systematically about where work buys the most:
- Cache at the right layer. Serving from an in-process cache costs ~100 ns; Redis costs ~1 ms; Postgres costs ~10–50 ms. Moving a hot lookup one layer closer cuts that allocation dramatically.
- Colocate. A same-data-center hop is ~0.5 ms. A cross-region hop is ~30–100 ms. Colocating a service with its primary caller removes the network hop from the budget entirely.
- Batch. Ten sequential SQL queries at 5 ms each = 50 ms. One query returning the same rows = 5 ms. Batching collapses serial I/O into parallel or single I/O.
- Precompute. If a complex aggregation would cost 200 ms at query time but is needed on every page load, compute it on write and store the result. You pay once on write; every read is a cache hit.
- Drop a hop. Does every request need to call the fraud service synchronously, or can fraud scoring be async (accepting the request, checking after, acting on failures)? Removing a synchronous call removes its allocation entirely.
- Leave headroom. Do not budget to 100%. A system that runs at its ceiling has no room for traffic spikes, slow rollouts, noisy neighbours, or the next feature that adds a call. The 143 ms headroom in the table above is not waste — it is the margin that keeps the budget honest under real-world variance.
Do wire the budget into your observability: instrument each component, track its p99 latency, and alert when any component exceeds its allocation by more than 20%. A downstream service that silently degrades from 50 ms to 120 ms will now trigger an alert before it blows the customer SLO. Don't review performance only after a customer complains — by then the budget has been overdrawn for weeks.
Worked example: allocate a 300 ms p99 budget and find the over-budget component
Scenario: an e-commerce product-detail page calls three services and a database. The product team has agreed on p99 < 300 ms. Here is the call graph and initial measurements:
# Proposed call graph (all serial — each call needs the previous result)
Step 1: Network (client US → server US-East) 30 ms budget
Step 2: API gateway (auth, routing) 10 ms budget
Step 3: Product service — fetch product details 15 ms budget
Step 4: Inventory service — check stock levels 25 ms budget ← serial on product
Step 5: Pricing service — fetch dynamic price 40 ms budget ← serial on product
Step 6: Review service — fetch top 3 reviews 20 ms budget ← serial on product
Step 7: Serialise & transmit response 15 ms budget
Step 8: Headroom (tail variance) 145 ms budget
─────────────────────────────────────────────────
Total 300 ms
# Now we measure in staging:
Step 1: 28 ms ✓
Step 2: 11 ms ✓ (just inside)
Step 3: 14 ms ✓
Step 4: 31 ms ⚠️ 6 ms over budget
Step 5: 92 ms ✗ 52 ms over budget ← PROBLEM
Step 6: 18 ms ✓
Step 7: 16 ms ✓
Total measured: 210 ms raw → p99 estimate: 210 × 1.7 = 357 ms — OVER target
The pricing service (Step 5) is the culprit: 92 ms measured against a 40 ms budget, 52 ms over. Even accounting for the fact that Steps 4–6 could theoretically be parallelised (they only need the product ID from Step 3), the pricing service would still blow the parallel composite budget.
# Fix attempt 1: parallelise steps 4, 5, 6 after step 3
Steps 4, 5, 6 dispatched concurrently after step 3 returns.
Serial total: 28 + 11 + 14 + max(31, 92, 18) + 16 + 15 = 176 ms raw
p99 estimate: 176 × 1.7 = 299 ms ← just at the boundary!
But: fan-out tail risk — 3 parallel calls → composite p99 worse than individual p99.
With p99=92 ms for pricing, this is still risky.
# Fix attempt 2: cache pricing service results (price changes at most every hour)
Add Redis cache in front of pricing service.
Cache hit (80% of calls): ~1 ms ← budget easily met
Cache miss (20% of calls): ~92 ms ← still slow but rare
Effective weighted latency: 0.8×1 + 0.2×92 = 19 ms average
p99 now determined by cache-miss rate, not individual call speed.
This is the right fix: pricing data is cacheable, cache TTL matches business rules.
Stating a latency budget in a system design interview is a strong signal that separates senior candidates. Here is the exact move: after sketching the call graph, say "I'd budget this endpoint at p99 < 300 ms and allocate it like this: 30 ms network, 10 ms gateway, [list components], 145 ms headroom." Then add "if we parallelise these three calls, the composite tail risk is about 1-(0.99)³ ≈ 3%, so the budget for each must be tighter." Most candidates guess; you've just done architecture under constraints. That is the senior-engineer frame for performance.
Under the hood: how it actually works
A latency budget is a constraint satisfaction problem: allocate B ms across N components such that the sum is ≤ B. What makes it non-trivial is the interaction between serial vs parallel topology, the tail amplification of fan-out, and the difference between "median budget" and "p99 budget".
Concrete budget allocation — 300 ms p99, step by step
Scenario: a checkout API. Client is in the US (same region as server). The call graph has one serial backbone with two branches that can be parallelised after the product lookup.
────────────────────────────────────────────────────────
Call graph with topology labels
────────────────────────────────────────────────────────
SERIAL chain (must be sequential):
1. Network in (client → server, same region) 30 ms
2. API gateway: JWT validate + rate-limit check 10 ms (token in gateway cache)
3. Product service: fetch product row 15 ms (Postgres, indexed, warm)
PARALLEL fan-out (dispatched together after step 3):
4a. Inventory service: check stock 25 ms
4b. Pricing service: dynamic price lookup 40 ms ← slowest sibling
4c. Promotions service: eligible discounts 20 ms
Fan-out total = max(25, 40, 20) = 40 ms
SERIAL chain resumes:
5. App: merge results + build response 5 ms
6. Network out (server → client) 30 ms
────────────────────────────────────────────────────────
Sum to median
────────────────────────────────────────────────────────
Median estimate = 30+10+15 + 40 + 5+30 = 130 ms
Headroom to 300 ms target = 300 − 130 = 170 ms
────────────────────────────────────────────────────────
p99 estimate
────────────────────────────────────────────────────────
Variance sources:
• Postgres (step 3) p99 ≈ 3× median = 45 ms (+30 ms)
• Pricing service (step 4b) p99 ≈ 2.5× median = 100 ms (+60 ms)
• Network jitter at p99: +20 ms
p99 estimate = 130 + 30 + 60 + 20 = 240 ms ← under 300 ms ✓
Headroom consumed: 240 / 300 = 80%.
Recommendation: healthy — leave it; tighten the budget if you add a new service call.
Fan-out tail amplification — the math
Parallel calls save median latency but increase the probability of hitting a tail on any given request. The math is straightforward:
# Probability that at least one of N independent calls is slow
# q = probability a single call is slow (e.g. 0.01 for 1%)
P(at least one slow) = 1 − (1 − q)^N
N=1, q=0.01: 1 − (0.99)^1 = 1.0% ← baseline
N=3, q=0.01: 1 − (0.99)^3 = 2.97% ← 3 parallel calls in checkout
N=5, q=0.01: 1 − (0.99)^5 ≈ 4.9%
N=10, q=0.01: 1 − (0.99)^10 ≈ 9.6%
N=100,q=0.01: 1 − (0.99)^100 ≈ 63% ← 100 fan-out calls, ~63% of requests hit a tail
# Practical rule: if each call has 1% slow rate and you fan out to 100 calls,
# 63% of composite requests will wait for at least one tail. Your p50 of
# the composite becomes the p99 of the individual.
# Threshold heuristic: N × q < 0.05 keeps composite tail probability under ~5%.
# For q=0.01: safe up to N≈5. For q=0.001: safe up to N≈50.
This is why you must budget for the composite p99 of a fan-out group, not the individual call p99. If each of 5 downstream calls has a 50 ms p99, the parallel group's p99 is close to 50 ms (the slowest sibling) — but the probability that you pay 50 ms on any given request is 1-(0.99)^5 ≈ 5%, not 1%. Your p95 is now equal to their individual p99.
A common mistake: calling three services that could be parallel but writing sequential await calls in the code, then budgeting as if they are parallel. The budget says 40 ms (max of three calls); the code pays 85 ms (25+40+20 serially). The mismatch shows up as a budget overrun that's hard to trace until someone reads the code. Always verify parallelism in the implementation, not just in the design.
How to debug & inspect it
A budget overrun has two possible root causes: a component drifted past its allocation, or a new component was added without removing budget from elsewhere. Both require the same first step: find which hop in the trace exceeds its allocation. Distributed tracing tools (Jaeger, Tempo, AWS X-Ray, Datadog APM) show the waterfall you need; so does curl with OpenTelemetry-instrumented endpoints and the Server-Timing response header.
Reading a trace waterfall to find the over-budget hop
In a trace waterfall tool, look for:
- The longest bar in the parallel fan-out group — this is the span that sets the wall-clock cost of the whole group.
- Any span that starts only after another has finished when they should be parallel — this indicates sequential dispatch in code (see callout above).
- Gaps between spans with no active span — these indicate thread pool queuing, connection pool wait, or request serialisation overhead.
Budget-blown triage table
| Symptom in trace / monitoring | Cause | Action |
|---|---|---|
| One span in a parallel group is 3–5× its budget; others are fine | That downstream service regressed (new query, added a serial call, DB migration in progress) | Alert the owning team; short-term: add a cache in front of that service if data allows; long-term: root-cause the regression |
| All spans in a parallel group are within budget, but wall-clock time of the group is still over | Calls are actually serial in code; await A; await B; instead of Promise.all |
Inspect the call site; replace sequential awaits with concurrent dispatch |
| Endpoint is over budget only at high load (p50 fine, p99 blows up) | Connection pool or thread pool exhaustion — requests queue before dispatch | Add pool-wait metric; increase pool size or reduce concurrency of callers; add a circuit breaker |
| Budget was fine; new feature team added a synchronous call; p99 jumped | Budget not enforced as a gate — call was approved without budget analysis | Wire the budget into CI/CD: fail deployments that add a new synchronous call without a budget amendment document; or make the new call async |
| A service that was under budget drifts over slowly over weeks | Data growth: queries that were fast on small tables slow down as row count grows | Add index, partition, or archive old rows; set a monitoring alert at 80% of budget allocation, not 100% |
| Budget is met at p99 but a few users see 2–3 s responses | p99.9 or p99.99 events — GC stop-the-world, DB lock timeout, or cold-start after deploy | Look at p99.9 histogram; if GC: tune heap or switch to low-pause collector; if cold-start: use canary deploys + connection pre-warming |
Debug checklist:
- Confirm the endpoint is actually over budget by querying your monitoring for p99 latency by component (not just total).
- Open the slowest trace in your trace tool; look at the waterfall. Identify which span is the longest and how far it exceeds its budget.
- Check whether over-budget spans are in a serial chain or a parallel group — the fix differs.
- Look at the time distribution (histogram) of the over-budget span — is it bimodal (cache hit/miss) or has a long tail (GC/lock)? Each implies a different fix.
- Check
Server-Timingheaders in the failing request to confirm your trace tool's reading. - Verify in code that intended parallel dispatches are actually concurrent — search for sequential awaits on independent calls.
🧠 Quick check
1. You have a p99 budget of 200 ms for an endpoint. The network costs 50 ms each way, the app costs 10 ms, the DB costs 40 ms. How much headroom is left?
Serial sum: 50 (in) + 10 (app) + 40 (DB) + 50 (out) = 150 ms. Budget is 200 ms. Headroom = 200 - 150 = 50 ms.
2. You make 5 parallel downstream calls, each with an independent 1% probability of being slow. What is the approximate probability that at least one call is slow?
P(at least one slow) = 1 - P(all fast) = 1 - (0.99)^5 ≈ 4.9%. Each independent call contributes its own tail risk; fan-out multiplies, not dilutes, tail exposure.
3. A new feature team wants to add a synchronous call to a recommendation service on every product page load. The rec service p99 is 60 ms. The current budget already uses 260 ms of a 300 ms target. What should you do?
The budget is a contract. Adding 60 ms to a 260 ms serial path breaks the 300 ms target. The fix is either async processing (fire-and-forget after response) or cutting 60 ms from another component to make room.
4. Which lever gives the largest latency saving for a call that costs 80 ms due to an uncached database query on a hot, frequently-requested resource?
An in-memory cache turns a 80 ms DB hit into a ~1 ms cache hit for the majority of requests. That's an 80× improvement on the hot path. An index cuts to ~10 ms (8×) and better SSD to ~40 ms (2×). Caching is the highest-leverage lever when the resource is frequently re-read.
✍️ Exercise 1: split a 250 ms budget across a call graph
Design a latency budget for a social-feed endpoint: p99 target = 250 ms. The call graph (all serial): client (mobile, average 4G) → CDN edge → app server → Redis (same DC) → if miss, Postgres (same DC) → downstream user-profile service (same region, different DC) → response. Assume 15% cache miss rate.
Model answer:
# Component allocations (your numbers may differ; reasoning matters)
Mobile 4G to CDN edge: ~20 ms
CDN edge to app server: ~5 ms
App logic: ~5 ms
Redis (hit, 85% path): ~1 ms
Postgres (miss, 15% path): ~25 ms (weighted: 0.15 × 25 = 3.75 ms avg)
User-profile service: ~35 ms (different DC, ~15 ms network + ~20 ms logic)
Response path (app → client): ~25 ms
Headroom: ~134 ms
─────────────────────────────────────────
Total: 250 ms
# Key decisions:
- Redis hit path keeps the common case fast (85% ≈ 56 ms serial to response)
- Postgres miss is 25 ms — acceptable because rare
- User-profile service is the biggest single slice: candidate to cache or colocate
- Headroom = ~54% of budget — healthy; tighten if measurements show consistent use below 50%
Rubric: ✓ components sum to ≤ 250 ms ✓ cache hit vs miss path distinguished ✓ identifies user-profile service as largest controllable slice ✓ leaves meaningful headroom ✓ notes which component to optimise first. Five of five = full marks.
✍️ Exercise 2: a request is over budget — where do you look and what do you cut?
Your monitoring shows the checkout API p99 has drifted from 210 ms to 410 ms over the past two weeks. The budget is 300 ms. Flamegraph data shows: network (in+out) = 60 ms (unchanged), gateway = 12 ms (unchanged), app logic = 8 ms (unchanged), payment service = 280 ms (was 90 ms), inventory check = 15 ms (unchanged). Where is the regression and what are your options?
Model answer:
# Diagnosis:
Payment service has regressed from 90 ms → 280 ms (+190 ms).
All other components are within original budgets.
The total overrun (410 - 300 = 110 ms) is explained entirely by this regression.
# Options (in order of architectural preference):
1. Root-cause the payment service regression.
Check: did a dependency (their DB, their upstream, their deployment) change?
Check: is it a hot-spot query missing an index? A new synchronous call they added?
Fix the root cause — restore it to ~90 ms.
2. If the 280 ms is unavoidable (external vendor slowdown):
Make the payment call async: accept the order optimistically, confirm async.
Cost: complexity of saga/compensation pattern.
Saves: 280 ms removed from p99 hot path.
3. Cache payment method validation results (e.g., card validity = 15-min TTL).
Reduces payment service calls for repeat customers.
Typical e-commerce: ~70% of checkouts are returning users → 70% cache hit rate.
Effective payment latency: 0.7 × 2ms + 0.3 × 280ms = ~85 ms.
# Never do: raise the SLO target to 500 ms to "fix" the alert.
# The budget is a regression gate — changing it hides the problem, not the cost.
Rubric: ✓ correctly identifies payment service as the only changed component ✓ quantifies the regression precisely ✓ proposes root-cause investigation before architectural changes ✓ offers async as the high-leverage architectural option ✓ explicitly rejects raising the SLO. Five of five = full marks.
Key takeaways
- A latency budget is a p99 target divided among every component in the request path. It forces trade-offs, catches regressions, and aligns teams — all without requiring coordination on every feature.
- Serial work adds; parallel work takes the max. Parallelising calls saves time but amplifies tail: N parallel calls make at least one tail hit ~N× more likely.
- Fan-out tail amplification means the composite p99 of N parallel calls equals roughly the p(100−N)th percentile of an individual call. Budget for the composite, not the individual.
- The highest-leverage budget cuts come from caching, colocation, batching, precomputation, and dropping synchronous hops entirely.
- Leave headroom. A budget fully spent is a budget already broken under the first traffic spike or noisy-neighbour event.
- Wire budgets into alerts. A component that quietly drifts past its allocation is a future incident. The budget must be observable and enforced, not just documented.
- Stating and defending a latency budget in a design interview is senior-engineer signal: it shows you think in systems, constraints, and trade-offs — not just features.
Sources & further reading
- Google SRE Book — Service Level Objectives (SLO framing and latency SLOs)
- Google SRE Book — Monitoring Distributed Systems (percentiles and alerting on SLOs)
- Lesson 04 — Latency vs throughput (percentile foundations and latency reference numbers)
- Lesson perf-01 — Estimating response time (STAMP method, the estimation foundation this lesson builds on)
- Tail at Scale — Dean & Barroso, Communications of the ACM (the canonical reference on tail amplification)
- web.dev — HTTP/2 and request multiplexing (parallelising browser requests)