API Design

Performance · Lesson 01

Estimating response time

Before you build, you should be able to sketch on a napkin whether your endpoint will feel instant or glacial. The STAMP method — five ordered steps — turns fuzzy guesses into defensible estimates in under five minutes.

⏱ 11 min Difficulty: core Prereq: Lesson 04 (Latency & Throughput)

By the end you'll be able to

Why estimate at all?

Two anti-patterns haunt performance conversations: build then measure (you discover a 3-second response time after shipping) and guess without grounding ("it'll be fast enough"). Neither is engineering.

An estimate — even a rough one — forces you to name every step a request takes and attach a number to each. Those numbers come from the latency reference table you already know (see Lesson 04): RAM reads in ~100 ns, SSD reads in ~100 µs, a round-trip across a data center in ~500 µs, a cross-continental network round-trip in ~100–150 ms. When you can add those up, you stop being surprised by production profiles.

The STAMP method

STAMP is a five-step walk through a request's lifecycle. Run the steps in order; jot numbers as you go; sum at the end.

  1. S — Sketch the hops. Draw (mentally or on paper) the full path of the request from client to response: DNS, TLS handshake, load balancer, app server, cache layer, database, any downstream services, and the return trip. Every arrow is a hop; every hop is a potential latency source.
  2. T — Tag each hop with a time. Attach an order-of-magnitude figure to each hop using the standard reference numbers. You rarely need precision — rounding to the nearest 10× is fine at this stage.
  3. A — Arrange as serial or parallel. Serial work (step A must finish before step B can start) adds. Parallel work (steps A and B run concurrently) takes the maximum of the parallel group. Misidentifying this is the most common arithmetic error in an estimate.
  4. M — Multiply for tail. An average is a lie (see Lesson 04). For p99, apply a tail factor: database queries often have a 3–5× spread between median and p99 due to lock contention, GC pauses, and cold cache; network hops add variance from congestion. A conservative working rule: multiply your raw sum by 1.5–2× to reach a realistic p99 estimate.
  5. P — Pressure-test the dominant term. One hop almost always dominates. Find it, then ask: "what would cutting this in half do to the total?" If it barely moves the total, you found the wrong term to optimise.

Reference latency numbers

These are the numbers STAMP's "Tag" step draws from (full discussion in Lesson 04):

OperationTypical latencyOrder of magnitude
RAM read (L1/L2 cache)~1–10 nsNanoseconds
RAM read (main memory)~100 nsNanoseconds
SSD random read~100 µsMicroseconds
HDD random read~10 msMilliseconds
Same data-center round-trip~500 µs – 1 msSub-millisecond
Cross-region round-trip~30–100 msTens of ms
Cross-continental round-trip~100–150 ms~100 ms
Simple DB query (indexed, warm cache)~1–5 msLow ms
Simple DB query (cold, disk-backed)~20–50 msTens of ms

Diagram: a request broken into timed steps

Network in ~40ms DB ~10ms Network out ~40ms (headroom / tail variance) Client sends request 40 ms Load Balancer <1 ms 1 ms App Server 5 ms 1 ms Database 10 ms (dominant) response travels back → network out ~40 ms Client receives Serial sum: 40 + 1 + 5 + 10 + 1 + 40 = 97 ms → p99 estimate (×1.8) ≈ 175 ms Network dominates (80 of 97 ms raw). Cutting DB time from 10→2 ms saves only 8 ms on the total.
Each labelled block is one step in the STAMP walk. The serial sum gives a median estimate; applying the tail multiplier gives p99. The network hops (blue) dwarf the DB work (amber) — the right target to optimise is the round-trip distance, not SQL tuning.

Worked example: estimate a read endpoint's p99

Scenario: a mobile app fetches a user's activity feed. The app server is in US-East; the user is in Western Europe; there is a Redis cache; if the cache misses, the app queries Postgres.

# STAMP walk for GET /v1/feed (cache-miss path)

S — Hops:
  Client (Europe) → CDN edge (Europe) → App server (US-East)
  → Redis (same DC as app) → Postgres (same DC) → back

T — Tag each hop:
  1. Client to CDN edge:          ~10 ms  (nearby PoP)
  2. CDN edge to US-East:         ~80 ms  (transatlantic)
  3. Load balancer overhead:       ~1 ms
  4. App processing (auth, logic): ~3 ms
  5. Redis lookup (same DC):       ~1 ms  (MISS — fall through)
  6. Postgres query (indexed):     ~8 ms  (warm buffer pool)
  7. App serialise JSON:           ~2 ms
  8. US-East back to CDN edge:    ~80 ms
  9. CDN edge to client:          ~10 ms

A — All serial (no parallel calls):
  Total = 10+80+1+3+1+8+2+80+10 = 195 ms  (median estimate)

M — Tail factor for p99:
  Postgres p99 can spike to ~25 ms (3× median) under load.
  Network jitter adds ~20 ms.
  Revised sum: 10+80+1+3+1+25+2+80+10+20 = 232 ms

P — Dominant term check:
  Network (transatlantic, both directions) = 160 ms of 195 ms raw.
  DB is 8 ms. Even eliminating DB entirely saves only 4%.
  → The right lever is edge caching (serve from CDN), not DB tuning.
🎯 Interview angle

When an interviewer says "how fast will this be?", walk STAMP out loud: name every hop, attach numbers, identify whether work is serial or parallel, and finish with a p99 estimate rather than an average. Doing this — especially the "P" step that identifies the dominant term — signals you understand that knowing what to optimise matters more than optimising blindly. Most candidates guess; you'll calculate.

⚠️ Common trap

Forgetting the return trip. Engineers often add up the inbound hops and forget that the response travels back along roughly the same path. On a cross-continental request, that's another 100 ms you've already paid before you even touch the database. Always count both directions.

✅ Do this, not that

Do report p99 and state your tail multiplier ("I assumed 1.8× for tail"). Don't report a raw average — it describes no real user's experience and hides the worst-case behaviour that will dominate support tickets. A p99 estimate that says "this could hit 230 ms" gives your team an actionable SLO target before you build.

Under the hood: how it actually works

STAMP's arithmetic rests on two rules that come directly from how CPU scheduling and I/O work. Understanding why they hold lets you apply them correctly in novel situations.

Rule 1 — Serial steps add

When step B can only start after step A finishes (B reads A's output, or you've written them sequentially in code), the wall-clock time is simply the sum. There is no concurrency, so no overlap:

# Serial chain: A then B then C
T_total = T_A + T_B + T_C

# Example: auth lookup → DB query → JSON serialise
T_total = 5 ms + 20 ms + 2 ms = 27 ms

Rule 2 — Parallel steps take the MAX

When two tasks are dispatched at the same moment and neither needs the other's result, the caller waits until the last one finishes. The fast sibling's time is hidden inside the slow sibling's time:

# Parallel dispatch: A and B start simultaneously
T_total = max(T_A, T_B)

# Example: cache lookup (1 ms) OR db fallback (20 ms) dispatched together
T_total = max(1 ms, 20 ms) = 20 ms
# The cache result arrives first but you still wait for the DB
# (unless you implement a "first wins" hedge — a separate pattern)

Worked numeric example: one call graph, end to end

Scenario: a regional user (US-West) calls GET /v1/feed. The server is US-East. There is an in-process LRU cache checked first; on a miss, a Redis lookup is tried; on a second miss, Postgres is queried. Auth token validation runs in-process against a cached JWKS. Response is gzip-compressed before transmission.

───────────────────────────────────────────────────────
  Step-by-step call graph with relationship types
───────────────────────────────────────────────────────

SERIAL group 1 — must happen in order:
  A. Client TLS + TCP handshake (cold)     60 ms   (1 RTT US-West→US-East)
  B. Receive request headers + body         2 ms   (parse + TLS decrypt)
  C. JWT validation (in-process, JWKS cached)  1 ms (HMAC-SHA256 verify)

SERIAL group 2 — cache waterfall (each tried only on previous miss):
  D1. LRU cache check (in-process RAM)      0 ms  (nanoseconds, rounds to 0)
      HIT (70% of calls) → skip D2, D3
  D2. Redis lookup (same DC, ~500 µs)       1 ms
      HIT (20% of remaining) → skip D3
  D3. Postgres query (same DC, indexed)    20 ms  (10% of all calls reach here)

SERIAL group 3 — response assembly:
  E. JSON serialise + gzip compress         2 ms
  F. Network US-East → US-West (return)   60 ms  (same RTT)

───────────────────────────────────────────────────────
  SUM by path
───────────────────────────────────────────────────────

Cache-hit path (70%):   60+2+1+0+2+60 = 125 ms  (median)
Redis-hit path (20%):   60+2+1+1+2+60 = 126 ms  (median)
DB-miss path (10%):     60+2+1+20+2+60 = 145 ms (median)

Weighted median:
  0.70 × 125 + 0.20 × 126 + 0.10 × 145
  = 87.5 + 25.2 + 14.5 = 127 ms  ← p50 estimate

p99 estimate (M step):
  The DB miss path (10%) is the tail.
  Postgres p99 can reach ~60 ms (3× median under write contention).
  Add network jitter: +15 ms at p99.
  DB-miss p99 path: 60+2+1+60+2+60+15 = 200 ms
  Because 10% of calls hit DB and DB has its own tail, p99 of the
  overall endpoint ≈ 200 ms.

P — Dominant term:
  Network (in+out) = 120 of 145 ms on the slow path (83%).
  Moving the server to US-West would cut ~110 ms. DB tuning saves ≤20 ms.
⚠️ Cache waterfalls are serial, not parallel

A tiered cache (LRU → Redis → DB) is a serial chain — you check the first tier, wait for the answer, then decide whether to check the next. It is not parallel. The common mistake is to use the average across tiers as if they were simultaneous branches. In reality, the miss path pays all three latencies in sequence. Budget each tier individually and account for the miss-rate-weighted sum when estimating real-world p50.

How to debug & inspect it

An estimate is a model. Reality differs — sometimes by a little (good calibration), sometimes by a lot (a hidden hop, a misconfigured timeout, a serial call you thought was parallel). The debug workflow is: measure → compare → identify the gap → fix the model or the system.

Measuring real end-to-end timing with curl

curl's -w (write-out) flag produces a per-phase timing breakdown. This gives you the same structure as your STAMP walk so you can compare them directly:

$ curl -o /dev/null -s -w " namelookup: %{time_namelookup}s connect: %{time_connect}s tls: %{time_appconnect}s pretransfer: %{time_pretransfer}s starttransfer:%{time_starttransfer}s total: %{time_total}s size_download:%{size_download} bytes " https://api.example.com/v1/feed namelookup: 0.008s # DNS only connect: 0.068s # +TCP handshake tls: 0.128s # +TLS handshake pretransfer: 0.128s # ready to send first byte starttransfer:0.183s # time to first byte (TTFB) = server + network total: 0.201s # +download body size_download:4821 bytes

Map curl's fields onto your STAMP estimate:

curl fieldWhat it measuresSTAMP step
time_namelookupDNS resolution onlyT — DNS hop
time_connect − time_namelookupTCP 3-way handshakeT — first RTT
time_appconnect − time_connectTLS handshakeT — TLS cost
time_starttransfer − time_appconnectTime to first byte (server + network)T — server steps + return trip
time_total − time_starttransferBody download timeT — payload size / bandwidth

Symptom → cause → fix table

Observed symptomLikely causeFix / next step
Reality is 2–3× your estimate; time_appconnect is the gap You estimated a warm (reused) connection; curl opened a cold one — DNS + TCP + TLS added 100–200 ms Add cold-connection setup time to your STAMP estimate; deploy keep-alive / HTTP/2 to amortise in production
time_starttransfer − time_appconnect is much larger than your STAMP server total A hidden serial hop (auth service call, config fetch, lazy DB pool connection) not in your sketch Add distributed tracing (OpenTelemetry) or server-side timing headers (Server-Timing) to expose the hidden step
Estimate matches p50 measurements but p99 is 3–4× higher than your tail multiplier predicted GC pause, lock contention, or a co-located noisy-neighbour causing multi-second spikes at the tail Look at the distribution (histogram), not just p99; investigate GC logs, DB slow-query log, thread pool exhaustion
time_total − time_starttransfer is large even though server processing is fast Response body is large (no compression, or very wide JSON) and bandwidth-limited on client side Enable Brotli/gzip, add field selection, paginate; re-run curl and verify size_download drops
Two independent calls you sketched as parallel show up sequential in traces Code dispatches them with await callA(); await callB(); — sequential await, not Promise.all() Replace with await Promise.all([callA(), callB()]) or equivalent concurrent dispatch in your language
Estimate and measurement agree at median, but SLO alerts fire nightly at ~2 AM Tail events correlate with a batch job, DB maintenance, or GC cycle — not visible in single-request model Add a time-of-day dimension to your latency data; schedule batch work to avoid peak read windows

Debug checklist:

  1. Run curl -w (cold connection, same region as your production origin) and capture all six timing fields.
  2. Map each field to the corresponding STAMP step. Identify which phase is the largest gap from your estimate.
  3. If TTFB is large, add Server-Timing headers to your app to see which server-side step dominates.
  4. Check whether calls you expected to be parallel are actually dispatched concurrently — add a trace or log timestamps at dispatch and response receipt.
  5. Compare p50 to p99 in your monitoring. If the ratio exceeds your tail multiplier, look for bimodal latency (cache miss vs hit) or periodic spikes.
  6. Identify the dominant term (largest measured segment) and confirm it matches the STAMP dominant term. If not, update the estimate model.

🧠 Quick check

1. You run STAMP and find: network in 40 ms, DB 5 ms, app logic 3 ms, network out 40 ms. Which step most deserves optimisation effort?

The "P" step in STAMP tells you to find the dominant term. Here network is 80 of 88 ms — optimising the DB from 5 ms to 1 ms saves only 5%, while moving the user closer (CDN/edge) would cut 80 ms.

2. Why does STAMP produce a p99 estimate rather than an average?

Averages are skewed by outliers and mask the tail. At scale, if 1% of requests are slow, that's a large number of real users. p99 is the senior engineering metric for latency.

3. An endpoint makes two database calls that are completely independent of each other — neither uses the other's result. How do you combine their latencies in STAMP?

Independent calls that are dispatched concurrently finish when the slowest one finishes. Parallel latency = max(A, B), not A + B. You never get "the average" — you always wait for the tail.

4. An app is in US-East. Most users are in US-East too, but 10% are in Southeast Asia (round-trip ~220 ms). Your raw median estimate is 30 ms. What is a defensible p99 target?

p99 must account for the slowest 1% of requests. If 10% of users have a 220 ms base round-trip, many of those 10% will appear in the p99 bucket. A p99 of 30 ms is impossible if any users are that far away.

✍️ Exercise: estimate a write endpoint's p99
Scenario

Your service handles POST /v1/orders. The call graph is: client (US-West) → app server (US-East, 70 ms one-way) → validate input (1 ms) → write to Postgres (15 ms) → publish to a message queue (2 ms) → respond. The message queue publish and Postgres write are serial. Estimate the median and p99 response time.

Model answer:

# S — Hops
Client (US-West) → App (US-East) → Postgres → MQ → App → Client

# T — Tag
  Network US-West→US-East (in):  70 ms
  Input validation:               1 ms
  Postgres write:                15 ms
  MQ publish (same DC):           2 ms
  Network US-East→US-West (out): 70 ms

# A — All serial
  Sum = 70 + 1 + 15 + 2 + 70 = 158 ms  (median)

# M — Tail: Postgres write p99 ~45 ms (3× under write contention)
  Revised = 70 + 1 + 45 + 2 + 70 = 188 ms  (p99 estimate)
  Add network jitter ~15 ms → ~203 ms p99

# P — Dominant term
  Network = 140 of 158 ms (89%). Postgres is 15 ms.
  To beat 150 ms p99, move the server closer (multi-region) or serve from edge.
  Postgres optimisation alone cannot get there.

Rubric: ✓ identifies all hops ✓ tags correct reference latencies ✓ correctly labels all work as serial ✓ applies a tail multiplier and explains why ✓ identifies network as the dominant term and draws the correct architectural conclusion (not DB tuning). Five of five = full marks.

Key takeaways

Sources & further reading