API Design

Performance · Lesson 02

Latency budgets

A latency budget is a promise you make to yourself before you build: "this entire response will take no more than X ms at p99, and here is exactly how that X is divided among every component that touches the request." It is the single most effective tool for preventing the death-by-a-thousand-cuts that turns a fast system into a slow one over time.

⏱ 16 min Difficulty: advanced Prereq: perf-01 (Estimating response time)

By the end you'll be able to

What is a latency budget?

Think of household finances. You have $3,000 a month and you divide it: $1,200 rent, $400 food, $200 transport, $300 savings, remainder for discretionary. You can't spend $2,000 on rent and still make the numbers work — the total is fixed, so every allocation is a trade-off against the others.

A latency budget works identically. You pick a target — say, p99 < 300 ms — and divide it among every component in the request path: client network, API gateway, your service's logic, a cache hit, a cache miss (database), and any downstream calls. The components must sum to ≤ 300 ms. If someone adds a new external call without evicting something else, the budget is broken.

The target itself usually comes from user-experience research and product SLOs: under 100 ms feels instant, 100–300 ms feels responsive, 300 ms–1 s feels sluggish, over 1 s users disengage. A p99 < 300 ms is a defensible goal for most interactive APIs.

Why budget at all?

Three concrete reasons budgets beat ad-hoc performance review:

  1. Forces trade-offs up front. "We want to add a fraud-detection call on every checkout" is easy to approve in a feature review. It's harder to approve when the budget already shows that the payment service has 20 ms left and the fraud call costs 40 ms. The budget makes the cost visible before the commit.
  2. Catches regressions automatically. If you instrument each component and alert when any component exceeds its allocation, a slow dependency surfaces in staging — not three months after it ships as a customer complaint about checkout times.
  3. Aligns teams without coordination overhead. The auth team, the payment team, and the inventory team can each own their allocation independently. No weekly meeting needed: as long as each team stays within their slice, the overall contract holds.

The anatomy of a budget: client → gateway → service → storage → downstream

A typical web API request passes through five layers. Here is a worked allocation for a p99 target of 300 ms:

Layer Component Allocated (ms) Notes
Client ↔ Edge Network (client to nearest PoP) 25 Assume regional users on good connections
Edge ↔ Gateway CDN / TLS offload 5 TLS handshake amortised by connection reuse
Gateway API gateway (auth, rate-limit, routing) 10 Token validation from in-memory cache
Service Business logic (no I/O) 10 Pure CPU — serialise, validate, transform
Cache Redis (same DC) — hit path 2 ~80% of requests; negligible
Storage Postgres (same DC) — miss path 30 ~20% of requests; indexed query
Downstream Downstream pricing service 50 Separate microservice, same region
Response path Network (edge back to client) 25 Symmetric with inbound
Headroom Tail variance, GC, queue jitter 143 Difference to reach 300 ms p99 target
Total 300

The largest single allocation — 143 ms of headroom — is not slack; it is the acknowledged gap between your median estimates and the p99 tail. The tail factor captures GC pauses, OS scheduling jitter, queue back-pressure spikes, and the variance in your slowest database queries. If your measurements consistently use less of the headroom, you can tighten the target. If they frequently exceed it, you have a budget over-run that demands architectural attention.

Diagram 1: stacked budget across components

p99 target = 300 ms Net in 25ms GW App Postgres 30ms Downstream svc 50ms Net out 25ms headroom / tail variance 143 ms 300 ms Network Gateway App logic Storage Downstream Headroom Downstream service is the biggest controllable slice (50 ms) → candidate to parallelise or cache Every colour must fit inside the 300 ms bar. Grow any slice → shrink another or bust the budget.
A 300 ms p99 budget visualised as a stacked horizontal bar. The downstream service (red, 50 ms) is the largest single controllable allocation. The headroom block (dark, 143 ms) is the acknowledged tail variance gap between median estimates and the p99 ceiling.

Serial vs parallel: why it changes everything

The most consequential decision in latency budgeting is which calls are serial and which are parallel. Get it wrong and your budget arithmetic is off by 2×.

Serial: call A must finish before call B can start (B needs A's result, or you've written them that way). Total time = sum of all calls.

Parallel: calls A and B are dispatched simultaneously, neither waits on the other. Total time = max(A, B) — you only pay for the slowest one.

SERIAL (total = A + B) Call A 40 ms Call B 60 ms Total: 40 + 60 = 100 ms PARALLEL (total = max(A, B)) Call A 40 ms Call B 60 ms dispatch done Total: max(40, 60) = 60 ms Saved 40 ms vs serial ⚠️ Fan-out (N parallel calls) means you always pay for the slowest call — p99 of the whole set, not the average. Three calls at p90=20 ms each: the composite p99 is the slowest of three, not 20 ms.
Left: serial calls add (100 ms total). Right: parallel calls cost the maximum (60 ms total). Fan-out amplifies tail: the more parallel calls you make, the more likely one of them will be a tail hit.

Fan-out and tail amplification

Parallel calls save time on average, but they hide a trap: tail amplification. If a single call has a 1% chance of being slow, and you make 10 parallel calls, the probability that at least one is slow is roughly 10% — your composite p99 has become the p90 of an individual call.

Formally: if each call is independently slow with probability q, N parallel calls are all fast with probability (1-q)^N. The complement is the probability of at least one slow call. For q=0.01 (1%) and N=10, that's 1-(0.99)^10 ≈ 9.6%. You've paid the tail rate 10× more often. See Lesson 04 for the percentile foundations.

This is why fan-out architectures — BFF patterns, GraphQL resolvers calling N services, search suggestions that fan out to N ranking models — can have terrible p99 even when each individual call is fast.

⚠️ Common trap: the fan-out p99 trap

You have 5 downstream services each with p99 = 50 ms. You think your composite p99 is 50 ms because they run in parallel. In fact it's closer to the p99.9 of a single call — the probability that none of five services is in its tail is (0.99)^5 ≈ 95%, meaning 5% of composite requests will hit at least one tail. Your p95 equals their individual p99. Budget for the composite tail, not the individual one.

Where to spend or cut the budget

Once you have an allocation, you can reason systematically about where work buys the most:

✅ Budgets as regression gates

Do wire the budget into your observability: instrument each component, track its p99 latency, and alert when any component exceeds its allocation by more than 20%. A downstream service that silently degrades from 50 ms to 120 ms will now trigger an alert before it blows the customer SLO. Don't review performance only after a customer complains — by then the budget has been overdrawn for weeks.

Worked example: allocate a 300 ms p99 budget and find the over-budget component

Scenario: an e-commerce product-detail page calls three services and a database. The product team has agreed on p99 < 300 ms. Here is the call graph and initial measurements:

# Proposed call graph (all serial — each call needs the previous result)

Step 1: Network (client US → server US-East)      30 ms  budget
Step 2: API gateway (auth, routing)               10 ms  budget
Step 3: Product service — fetch product details   15 ms  budget
Step 4: Inventory service — check stock levels    25 ms  budget  ← serial on product
Step 5: Pricing service — fetch dynamic price     40 ms  budget  ← serial on product
Step 6: Review service — fetch top 3 reviews      20 ms  budget  ← serial on product
Step 7: Serialise & transmit response             15 ms  budget
Step 8: Headroom (tail variance)                  145 ms budget
       ─────────────────────────────────────────────────
       Total                                      300 ms

# Now we measure in staging:
Step 1: 28 ms   ✓
Step 2: 11 ms   ✓ (just inside)
Step 3: 14 ms   ✓
Step 4: 31 ms   ⚠️  6 ms over budget
Step 5: 92 ms   ✗  52 ms over budget  ← PROBLEM
Step 6: 18 ms   ✓
Step 7: 16 ms   ✓
Total measured: 210 ms raw  →  p99 estimate: 210 × 1.7 = 357 ms  — OVER target

The pricing service (Step 5) is the culprit: 92 ms measured against a 40 ms budget, 52 ms over. Even accounting for the fact that Steps 4–6 could theoretically be parallelised (they only need the product ID from Step 3), the pricing service would still blow the parallel composite budget.

# Fix attempt 1: parallelise steps 4, 5, 6 after step 3
  Steps 4, 5, 6 dispatched concurrently after step 3 returns.
  Serial total: 28 + 11 + 14 + max(31, 92, 18) + 16 + 15 = 176 ms raw
  p99 estimate: 176 × 1.7 = 299 ms  ← just at the boundary!

  But: fan-out tail risk — 3 parallel calls → composite p99 worse than individual p99.
  With p99=92 ms for pricing, this is still risky.

# Fix attempt 2: cache pricing service results (price changes at most every hour)
  Add Redis cache in front of pricing service.
  Cache hit (80% of calls):  ~1 ms   ← budget easily met
  Cache miss (20% of calls): ~92 ms  ← still slow but rare
  Effective weighted latency: 0.8×1 + 0.2×92 = 19 ms average
  p99 now determined by cache-miss rate, not individual call speed.

  This is the right fix: pricing data is cacheable, cache TTL matches business rules.
🎯 Interview angle — senior signal

Stating a latency budget in a system design interview is a strong signal that separates senior candidates. Here is the exact move: after sketching the call graph, say "I'd budget this endpoint at p99 < 300 ms and allocate it like this: 30 ms network, 10 ms gateway, [list components], 145 ms headroom." Then add "if we parallelise these three calls, the composite tail risk is about 1-(0.99)³ ≈ 3%, so the budget for each must be tighter." Most candidates guess; you've just done architecture under constraints. That is the senior-engineer frame for performance.

Under the hood: how it actually works

A latency budget is a constraint satisfaction problem: allocate B ms across N components such that the sum is ≤ B. What makes it non-trivial is the interaction between serial vs parallel topology, the tail amplification of fan-out, and the difference between "median budget" and "p99 budget".

Concrete budget allocation — 300 ms p99, step by step

Scenario: a checkout API. Client is in the US (same region as server). The call graph has one serial backbone with two branches that can be parallelised after the product lookup.

────────────────────────────────────────────────────────
  Call graph with topology labels
────────────────────────────────────────────────────────

SERIAL chain (must be sequential):
  1. Network in (client → server, same region)        30 ms
  2. API gateway: JWT validate + rate-limit check     10 ms   (token in gateway cache)
  3. Product service: fetch product row               15 ms   (Postgres, indexed, warm)

PARALLEL fan-out (dispatched together after step 3):
  4a. Inventory service: check stock           25 ms
  4b. Pricing service: dynamic price lookup    40 ms   ← slowest sibling
  4c. Promotions service: eligible discounts   20 ms
  Fan-out total = max(25, 40, 20) = 40 ms

SERIAL chain resumes:
  5. App: merge results + build response               5 ms
  6. Network out (server → client)                   30 ms

────────────────────────────────────────────────────────
  Sum to median
────────────────────────────────────────────────────────

Median estimate = 30+10+15 + 40 + 5+30 = 130 ms

Headroom to 300 ms target = 300 − 130 = 170 ms

────────────────────────────────────────────────────────
  p99 estimate
────────────────────────────────────────────────────────

Variance sources:
  • Postgres (step 3) p99 ≈ 3× median = 45 ms (+30 ms)
  • Pricing service (step 4b) p99 ≈ 2.5× median = 100 ms (+60 ms)
  • Network jitter at p99: +20 ms

p99 estimate = 130 + 30 + 60 + 20 = 240 ms  ← under 300 ms ✓

Headroom consumed: 240 / 300 = 80%.
Recommendation: healthy — leave it; tighten the budget if you add a new service call.

Fan-out tail amplification — the math

Parallel calls save median latency but increase the probability of hitting a tail on any given request. The math is straightforward:

# Probability that at least one of N independent calls is slow
# q = probability a single call is slow (e.g. 0.01 for 1%)

P(at least one slow) = 1 − (1 − q)^N

N=1,  q=0.01:  1 − (0.99)^1  = 1.0%   ← baseline
N=3,  q=0.01:  1 − (0.99)^3  = 2.97%  ← 3 parallel calls in checkout
N=5,  q=0.01:  1 − (0.99)^5  ≈ 4.9%
N=10, q=0.01:  1 − (0.99)^10 ≈ 9.6%
N=100,q=0.01:  1 − (0.99)^100 ≈ 63%   ← 100 fan-out calls, ~63% of requests hit a tail

# Practical rule: if each call has 1% slow rate and you fan out to 100 calls,
# 63% of composite requests will wait for at least one tail. Your p50 of
# the composite becomes the p99 of the individual.

# Threshold heuristic: N × q < 0.05 keeps composite tail probability under ~5%.
# For q=0.01: safe up to N≈5. For q=0.001: safe up to N≈50.

This is why you must budget for the composite p99 of a fan-out group, not the individual call p99. If each of 5 downstream calls has a 50 ms p99, the parallel group's p99 is close to 50 ms (the slowest sibling) — but the probability that you pay 50 ms on any given request is 1-(0.99)^5 ≈ 5%, not 1%. Your p95 is now equal to their individual p99.

⚠️ Serial total vs parallel total: never mix them up

A common mistake: calling three services that could be parallel but writing sequential await calls in the code, then budgeting as if they are parallel. The budget says 40 ms (max of three calls); the code pays 85 ms (25+40+20 serially). The mismatch shows up as a budget overrun that's hard to trace until someone reads the code. Always verify parallelism in the implementation, not just in the design.

How to debug & inspect it

A budget overrun has two possible root causes: a component drifted past its allocation, or a new component was added without removing budget from elsewhere. Both require the same first step: find which hop in the trace exceeds its allocation. Distributed tracing tools (Jaeger, Tempo, AWS X-Ray, Datadog APM) show the waterfall you need; so does curl with OpenTelemetry-instrumented endpoints and the Server-Timing response header.

Reading a trace waterfall to find the over-budget hop

$ curl -s -w "%{time_total}" -H "traceparent: 00-abc123-001-01" \ https://api.example.com/v1/checkout/42 | jq '.timing' # Or read Server-Timing header if your service emits it: $ curl -I https://api.example.com/v1/checkout/42 Server-Timing: gateway;dur=11, product-svc;dur=14, pricing-svc;dur=97, inventory-svc;dur=28, promotions-svc;dur=22, app;dur=6 # pricing-svc at 97 ms is 57 ms over its 40 ms budget — immediately visible

In a trace waterfall tool, look for:

Budget-blown triage table

Symptom in trace / monitoringCauseAction
One span in a parallel group is 3–5× its budget; others are fine That downstream service regressed (new query, added a serial call, DB migration in progress) Alert the owning team; short-term: add a cache in front of that service if data allows; long-term: root-cause the regression
All spans in a parallel group are within budget, but wall-clock time of the group is still over Calls are actually serial in code; await A; await B; instead of Promise.all Inspect the call site; replace sequential awaits with concurrent dispatch
Endpoint is over budget only at high load (p50 fine, p99 blows up) Connection pool or thread pool exhaustion — requests queue before dispatch Add pool-wait metric; increase pool size or reduce concurrency of callers; add a circuit breaker
Budget was fine; new feature team added a synchronous call; p99 jumped Budget not enforced as a gate — call was approved without budget analysis Wire the budget into CI/CD: fail deployments that add a new synchronous call without a budget amendment document; or make the new call async
A service that was under budget drifts over slowly over weeks Data growth: queries that were fast on small tables slow down as row count grows Add index, partition, or archive old rows; set a monitoring alert at 80% of budget allocation, not 100%
Budget is met at p99 but a few users see 2–3 s responses p99.9 or p99.99 events — GC stop-the-world, DB lock timeout, or cold-start after deploy Look at p99.9 histogram; if GC: tune heap or switch to low-pause collector; if cold-start: use canary deploys + connection pre-warming

Debug checklist:

  1. Confirm the endpoint is actually over budget by querying your monitoring for p99 latency by component (not just total).
  2. Open the slowest trace in your trace tool; look at the waterfall. Identify which span is the longest and how far it exceeds its budget.
  3. Check whether over-budget spans are in a serial chain or a parallel group — the fix differs.
  4. Look at the time distribution (histogram) of the over-budget span — is it bimodal (cache hit/miss) or has a long tail (GC/lock)? Each implies a different fix.
  5. Check Server-Timing headers in the failing request to confirm your trace tool's reading.
  6. Verify in code that intended parallel dispatches are actually concurrent — search for sequential awaits on independent calls.

🧠 Quick check

1. You have a p99 budget of 200 ms for an endpoint. The network costs 50 ms each way, the app costs 10 ms, the DB costs 40 ms. How much headroom is left?

Serial sum: 50 (in) + 10 (app) + 40 (DB) + 50 (out) = 150 ms. Budget is 200 ms. Headroom = 200 - 150 = 50 ms.

2. You make 5 parallel downstream calls, each with an independent 1% probability of being slow. What is the approximate probability that at least one call is slow?

P(at least one slow) = 1 - P(all fast) = 1 - (0.99)^5 ≈ 4.9%. Each independent call contributes its own tail risk; fan-out multiplies, not dilutes, tail exposure.

3. A new feature team wants to add a synchronous call to a recommendation service on every product page load. The rec service p99 is 60 ms. The current budget already uses 260 ms of a 300 ms target. What should you do?

The budget is a contract. Adding 60 ms to a 260 ms serial path breaks the 300 ms target. The fix is either async processing (fire-and-forget after response) or cutting 60 ms from another component to make room.

4. Which lever gives the largest latency saving for a call that costs 80 ms due to an uncached database query on a hot, frequently-requested resource?

An in-memory cache turns a 80 ms DB hit into a ~1 ms cache hit for the majority of requests. That's an 80× improvement on the hot path. An index cuts to ~10 ms (8×) and better SSD to ~40 ms (2×). Caching is the highest-leverage lever when the resource is frequently re-read.

✍️ Exercise 1: split a 250 ms budget across a call graph
Scenario

Design a latency budget for a social-feed endpoint: p99 target = 250 ms. The call graph (all serial): client (mobile, average 4G) → CDN edge → app server → Redis (same DC) → if miss, Postgres (same DC) → downstream user-profile service (same region, different DC) → response. Assume 15% cache miss rate.

Model answer:

# Component allocations (your numbers may differ; reasoning matters)
Mobile 4G to CDN edge:          ~20 ms
CDN edge to app server:          ~5 ms
App logic:                       ~5 ms
Redis (hit, 85% path):           ~1 ms
Postgres (miss, 15% path):      ~25 ms  (weighted: 0.15 × 25 = 3.75 ms avg)
User-profile service:           ~35 ms  (different DC, ~15 ms network + ~20 ms logic)
Response path (app → client):   ~25 ms
Headroom:                      ~134 ms
─────────────────────────────────────────
Total:                          250 ms

# Key decisions:
- Redis hit path keeps the common case fast (85% ≈ 56 ms serial to response)
- Postgres miss is 25 ms — acceptable because rare
- User-profile service is the biggest single slice: candidate to cache or colocate
- Headroom = ~54% of budget — healthy; tighten if measurements show consistent use below 50%

Rubric: ✓ components sum to ≤ 250 ms ✓ cache hit vs miss path distinguished ✓ identifies user-profile service as largest controllable slice ✓ leaves meaningful headroom ✓ notes which component to optimise first. Five of five = full marks.

✍️ Exercise 2: a request is over budget — where do you look and what do you cut?
Scenario

Your monitoring shows the checkout API p99 has drifted from 210 ms to 410 ms over the past two weeks. The budget is 300 ms. Flamegraph data shows: network (in+out) = 60 ms (unchanged), gateway = 12 ms (unchanged), app logic = 8 ms (unchanged), payment service = 280 ms (was 90 ms), inventory check = 15 ms (unchanged). Where is the regression and what are your options?

Model answer:

# Diagnosis:
Payment service has regressed from 90 ms → 280 ms (+190 ms).
All other components are within original budgets.
The total overrun (410 - 300 = 110 ms) is explained entirely by this regression.

# Options (in order of architectural preference):

1. Root-cause the payment service regression.
   Check: did a dependency (their DB, their upstream, their deployment) change?
   Check: is it a hot-spot query missing an index? A new synchronous call they added?
   Fix the root cause — restore it to ~90 ms.

2. If the 280 ms is unavoidable (external vendor slowdown):
   Make the payment call async: accept the order optimistically, confirm async.
   Cost: complexity of saga/compensation pattern.
   Saves: 280 ms removed from p99 hot path.

3. Cache payment method validation results (e.g., card validity = 15-min TTL).
   Reduces payment service calls for repeat customers.
   Typical e-commerce: ~70% of checkouts are returning users → 70% cache hit rate.
   Effective payment latency: 0.7 × 2ms + 0.3 × 280ms = ~85 ms.

# Never do: raise the SLO target to 500 ms to "fix" the alert.
# The budget is a regression gate — changing it hides the problem, not the cost.

Rubric: ✓ correctly identifies payment service as the only changed component ✓ quantifies the regression precisely ✓ proposes root-cause investigation before architectural changes ✓ offers async as the high-leverage architectural option ✓ explicitly rejects raising the SLO. Five of five = full marks.

Key takeaways

Sources & further reading