API Design

Performance · Lesson 03

Speeding up page & API loads

The single most common performance mistake is optimising the server while the network is the bottleneck. This lesson gives you a systematic toolkit — critical path, round-trip reduction, payload trimming, connection reuse, CDN placement, parallelism, and lazy loading — in the order that actually moves the needle.

⏱ 13 min Difficulty: core Prereq: perf-02 (Latency budgets)

By the end you'll be able to

Start with the critical path

The critical path is the longest chain of dependent work between a user action and a visible response. Nothing else matters until the critical path is short. A page that makes 10 API calls, 8 of which are independent, has a critical path of 2 serial calls — optimise those, not the 8.

Think of it as a project timeline: the critical path is the sequence of tasks where a delay in any one delays the finish. Parallel tasks off the critical path can be slow without affecting the user-visible result.

Concretely: draw your call graph. Trace the longest serial chain from request to first meaningful paint. That chain is your budget to attack. See Lesson perf-02 for how to allocate time across the chain.

Why the network dominates

From the latency reference table (see Lesson 04): a cross-continental round-trip is ~100–150 ms, a cross-region round-trip is ~30–100 ms, and a same-data-center round-trip is ~0.5–1 ms. Your server's CPU processes a request in 2–10 ms. A typical client-to-server round-trip is 30–150 ms depending on geography.

The implication is stark: one additional network round-trip costs more than 10–50× any plausible server-side optimisation. Reducing round trips is almost always the highest-leverage move before any server tuning.

Lever 1: reduce round trips

Every extra round trip is a latency tax paid at network speed. Common sources of unnecessary round trips:

Lever 2: reduce payload size

Smaller payloads transmit faster and parse faster, especially on mobile connections where bandwidth is the constraint.

Lever 3: connection reuse

TCP and TLS connections have setup costs. A cold TCP+TLS connection to an HTTPS server takes 2–3 round trips before the first byte of application data can flow:

On a 100 ms RTT link, that's 100–300 ms of overhead before your API call even starts. Connection reuse eliminates this on subsequent requests. See Lesson 05 for the full socket lifecycle.

Practical implications: use HTTP/2 (multiplexes multiple requests over one connection) or HTTP/1.1 with Connection: keep-alive. From the client side, pool connections in your HTTP client rather than opening a new one per request. From the server side, configure appropriate keep-alive timeouts.

Lever 4: CDN and edge placement

A CDN (Content Delivery Network) places servers at Points of Presence (PoPs) geographically close to users. The key insight: the speed of light is fixed — you cannot make a packet travel from Sydney to New York faster. You can serve the Sydney user from a Sydney PoP.

Three things CDNs give you:

  1. Static asset caching. JS, CSS, images, fonts served from a PoP 10 ms away instead of an origin 150 ms away. One-time cache miss pays origin cost; all subsequent requests are local.
  2. API response caching. For read-heavy endpoints with data that changes infrequently (product details, reference data, user-public profiles), caching the API response at the CDN edge turns every response into a ~10 ms local hit.
  3. TLS termination at the edge. The expensive TLS handshake happens between the user and the nearby PoP (~10 ms RTT), not between the user and the distant origin (~100 ms RTT). The PoP-to-origin connection is pre-warmed and persistent. This alone can save 200–400 ms on a cold connection to a distant origin.

Lever 5: parallelise independent calls

If the critical path includes calls whose inputs do not depend on each other's outputs, dispatch them simultaneously rather than sequentially. Three 50 ms calls dispatched in parallel cost ~50 ms; dispatched serially they cost 150 ms.

The caveat (from Lesson perf-02): fan-out amplifies tail. Three parallel calls at p99 = 50 ms each give you a composite p99 ≈ 1-(0.99)^3 × requests will hit at least one tail. Budget for the composite. In practice: if your parallelism is bounded (2–4 calls), the win almost always outweighs the tail risk.

Lever 6: lazy loading

Not everything on a page needs to be loaded before the user sees anything useful. Lazy loading defers non-critical data until after the critical path has rendered:

Lazy loading shortens the critical path payload even if total data transfer is unchanged. The user sees a response faster; the rest loads progressively.

Diagram: critical path before and after optimisation

BEFORE (serial, no optimisation) — 700 ms to first render DNS 60ms TLS cold handshake 200 ms (2 RTTs × 100ms) Auth API 100 ms Content API 150 ms Render 80ms Images load (serial) 130ms Critical path: DNS + TLS + Auth + Content + Render = 690 ms (images add 130 ms more) AFTER (CDN edge, parallel calls, connection reuse, lazy images) — 220 ms to first render TLS Auth 30 Content 50 Render 80 ms Images lazy (after render) Critical path to first render: 10 + 15 + max(30,50) + 80 = 155 ms Optimisation impact summary DNS: 60 ms → 10 ms (CDN PoP) TLS: 200 ms → 15 ms (warm CDN connection) Serial API calls: 250 ms → 50 ms (parallel) First-render critical path: 690 ms → 155 ms. Images deferred — do not block render. What did NOT help: server CPU tuning Server logic was ~5 ms before and after. Cutting it to 0 ms would have saved 5 ms on a 690 ms path (<1%). The network (DNS + TLS + round trips) was 510 of 690 ms. That is always where you start.
Before: DNS, cold TLS handshake, serial auth and content calls, and blocking image loads produce a 690 ms critical path. After: CDN edge cuts DNS/TLS, parallel API calls collapse serial wait, and lazy loading removes images from the critical path — first render in 155 ms. Server CPU was irrelevant throughout.

Worked example: optimising a product page API

# Before: three serial API calls from browser, no CDN, cold connections

Browser:
  1. DNS resolve api.example.com          60 ms
  2. TCP + TLS handshake (cold)          200 ms
  3. GET /v1/auth/session                100 ms  (blocking — client waits)
  4. GET /v1/products/42                 150 ms  (serial on auth)
  5. GET /v1/products/42/recommendations 120 ms  (serial on product)
  6. Render page                          80 ms
                                         ─────
  Total to render:                       710 ms

# After: CDN, connection reuse, parallel calls, lazy recommendations

Browser:
  1. DNS resolve (CDN PoP, nearby)        10 ms
  2. TCP + TLS (CDN warm connection)      15 ms
  3. GET /v1/auth/session     ┐
     GET /v1/products/42      ├ parallel  max(30, 50) = 50 ms
                              ┘
  4. Render page                          80 ms
                                         ─────
  Critical path to render:               155 ms

  5. GET /v1/products/42/recommendations  ← lazy, after render
     loaded in background, displayed when ready.

# Net saving: 710 ms → 155 ms to first render (78% improvement)
# Server CPU before and after: ~5 ms. Irrelevant to the outcome.
⚠️ Common trap: optimising server CPU when network dominates

Teams spend weeks micro-optimising database queries, switching serialisers, or profiling hot loops — and then measure a 5 ms improvement on a 600 ms page load. The network round trips and TLS handshake cost 500 of those 600 ms. The server was never the bottleneck. Before tuning any server-side code, draw the full critical path (including all network hops) and verify which segment actually dominates. If it's network, fix network. If it's server, then tune server.

🎯 Interview angle

When asked "how would you make this page/API faster?", the most impressive answer starts with: "First I'd identify the critical path and see where time is actually spent — network hops, serial API calls, payload size, or server processing — because the right lever depends entirely on which term dominates." Then work through the levers in order: reduce round trips, add CDN/edge, parallelise independent calls, trim payloads, lazy-load non-critical data. Jumping straight to "add a cache" or "index the DB" without identifying the bottleneck is a junior signal.

✅ Do this, not that

Do measure first: use browser DevTools (Network tab, waterfall view) or distributed tracing to see the actual critical path. Optimise the largest block you can see. Don't assume the server is the problem because you can control it — network and client-side rendering often dwarf server time, and those are equally optimisable with CDN placement, connection reuse, and parallelism.

Under the hood: how it actually works

Critical-path mechanics come down to one rule: dependent round trips stack; independent round trips collapse. Understanding exactly what forces a new round trip — and what allows requests to share one — lets you design the waterfall rather than just observe it.

What forces a new round trip

A new network round trip is forced when:

Before/after waterfall with numbers

Concrete scenario: a product detail page. The user is 40 ms from the CDN PoP; the CDN PoP is 80 ms from the origin (total cold client-to-origin = 120 ms one-way). Original implementation has no CDN, no parallelism, cold connections.

═══════════════════════════════════════════════════════════════════
  BEFORE: no CDN, cold connections, serial API calls, no lazy load
═══════════════════════════════════════════════════════════════════

RTT to origin = 120 ms (client is 120 ms from server)

Step          Duration    Cumulative    Notes
────────────────────────────────────────────────────────
DNS           60 ms       60 ms         Full recursive lookup
TCP handshake 120 ms      180 ms        1 RTT (SYN/SYN-ACK)
TLS 1.3       120 ms      300 ms        1 RTT (ClientHello/ServerHello)
GET /session  120+5 ms    425 ms        1 RTT + 5 ms server auth logic
GET /product  120+15 ms   560 ms        serial on session; 15 ms DB
GET /reviews  120+10 ms   690 ms        serial on product; 10 ms DB
Render        80 ms       770 ms        parse + paint
Images (6)    120+20 ms   1,070 ms      serial, blocking render
─────────────────────────────────────────────────────────────────
Total to interactive: 1,070 ms
Total to first render: 770 ms

Critical path (serial chain): DNS→TLS→session→product→reviews→render


═══════════════════════════════════════════════════════════════════
  AFTER: CDN edge, warm connections, parallel calls, lazy images
═══════════════════════════════════════════════════════════════════

RTT to CDN PoP = 40 ms (CDN is geographically close)

Step                  Duration   Cumulative    Notes
──────────────────────────────────────────────────────────────────
DNS (CDN PoP)         8 ms       8 ms          CDN handles DNS
TCP+TLS (CDN, warm)   10 ms      18 ms         CDN pre-warmed connection
GET /session  ┐
GET /product  ├ parallel  max(30+5, 30+15) = 45 ms   63 ms   both at once
              ┘           (product at 45 ms is the slower sibling)
GET /reviews  ─ parallel with above (no session dep)  (inside the 45 ms window)
Render        80 ms      143 ms        parse + paint
─────────────────────────────────────────────────────────────────
Total to first render: 143 ms  (84% improvement)

Images:  loaded lazily AFTER render — not on the critical path.
         User sees content at 143 ms; images fill in as they download.

# Server CPU in both scenarios: ~5 ms total. It contributed 0.5% of the
# 1,070 ms before and 3.5% of the 143 ms after. Not the bottleneck.

Why parallelising independent calls collapses the timeline

In the serial case, three 40 ms API calls add up to 120 ms — you pay three full round trips. When you dispatch all three simultaneously after the session resolves, you pay one round trip (the slowest sibling determines wall-clock cost). The other two complete inside that window "for free".

The key insight is that the graph of dependencies, not the list of calls, determines the minimum number of round trips. Formally: the minimum number of RTTs equals the length of the longest serial dependency chain in the call graph. Anything not on that chain can be collapsed into whichever serial step it depends on.

# Call graph for the product page (edges = "requires result of")

session ──→ product ──→ reviews
                    └──→ related-items
                    └──→ inventory

# Longest serial chain: session → product → reviews = 3 RTTs minimum
# related-items and inventory depend only on product → dispatch with reviews
# Minimum RTTs = 3: (1) connection setup, (2) session, (3) product+all-dependents

# If we add a "user preferences" call that depends ONLY on the session (not product):
session ──→ product ──→ reviews
        └──→ preferences   ← parallel with product, no extra RTT
⚠️ Resource discovery adds RTTs that don't appear in API design

A page that loads main.js, which contains an import('./feature.js') that dynamically imports another module, has a three-hop dependency chain just for JavaScript: HTML → main.js → feature.js. Each hop is a full RTT. This is why bundling (collapsing many modules into one file), preload hints (<link rel="preload">), and HTTP/2 server push exist — they break the discovery waterfall so resources load in parallel rather than one-by-one as the parser encounters them.

How to debug & inspect it

Three tools give you the full waterfall picture: browser DevTools Network tab (for browser-initiated loads), curl -w (for server-side API timing), and distributed traces (for backend service calls). The critical-path analysis process is: render the waterfall, find the longest serial chain, identify the longest single bar on that chain, fix it.

Reading a DevTools / curl waterfall

# Simulate a browser's cold-load timing with curl (verbose timing): $ curl -o /dev/null -s -w " dns: %{time_namelookup}s tcp: %{time_connect}s tls: %{time_appconnect}s ttfb: %{time_starttransfer}s total: %{time_total}s bytes: %{size_download} " https://api.example.com/v1/product/42 dns: 0.062s tcp: 0.182s # +120 ms TCP (1 RTT) tls: 0.304s # +122 ms TLS (another full RTT — TLS 1.3) ttfb: 0.426s # +122 ms server logic + network return total: 0.441s # +15 ms body download bytes: 8240 # Pattern: dns(62) + tcp(120) + tls(122) + server(122) + body(15) = 441 ms # TLS is as expensive as TCP — this is a cold connection to a far origin. # CDN would cut tcp+tls from 242 ms to ~10–20 ms combined.

In DevTools Network tab: select a request, look at the Timing sub-tab. The colour-coded bars map to:

DevTools colour / labelWhat it measuresOptimization lever
QueueingWaiting for a connection or for the browser to schedule the requestHTTP/2 multiplexing; reduce concurrent requests per host
DNS LookupResolving the hostnameCDN PoP (local resolver); DNS prefetch hints
Initial connectionTCP handshakeCDN edge; connection reuse; HTTP/2
SSLTLS handshakeCDN TLS termination at edge; TLS 1.3 (1-RTT); session resumption
Request sent / Waiting (TTFB)Time to first byte — includes server processing and network return tripServer-side optimisation; CDN caching; edge function
Content downloadBody transfer timeCompression; field selection; pagination; binary formats

Slow-load symptom → cause → fix table

Symptom (what you see in waterfall)Root causeFix
DNS + TCP + TLS together = 200–400 ms; server processing is small Cold connection to a far-away origin; no CDN Add CDN with PoP close to users; CDN terminates TLS at the edge, reducing per-request connection cost to ~10 ms
Several API calls form a staircase (each starts after the previous ends); could be parallel Sequential await in client code despite no data dependency Refactor to Promise.all / concurrent dispatch; collapse into a single BFF call if calls always happen together
A long chain of small requests, each starting after prior completes (e.g. JS import chains) Resource discovery waterfall — each document references the next Bundle JS modules; add <link rel="preload"> for critical resources; use HTTP/2 push or Early Hints for known dependencies
Content download bar is large even though TTFB is fast Response body is large; bandwidth-limited (mobile) or uncompressed Enable Brotli/gzip; implement field selection; paginate; use binary format for high-volume endpoints
Many requests start at the same time but only 6 complete per batch (staggered waves) Browser per-host connection limit (HTTP/1.1 allows max 6 concurrent per domain) Switch to HTTP/2 (multiplexes unlimited streams on one connection); or shard static assets across subdomains (HTTP/1.1 only)
First meaningful paint is late; images and non-critical API data are on the critical path Lazy loading not implemented; all resources fetched synchronously on page load Add loading="lazy" to below-the-fold images; defer non-critical API calls until after first render; use skeleton screens
TTFB is fast on a warm cache but spikes 10× when cache is cold (deploy or low-traffic hour) Cold database or in-process cache; first request after deploy pays full DB cost Warm caches on deploy (cache priming); use stale-while-revalidate to serve cached data while refreshing in background

Debug checklist — systematic critical-path analysis:

  1. Open DevTools Network tab; perform a hard reload (Shift+Refresh) to simulate a cold-start load; disable cache to get worst-case timing.
  2. Sort by start time; draw the dependency arrows: which request could not start until the previous one completed?
  3. Identify the long pole: the single request on the critical path with the longest bar. This is where to focus first.
  4. Expand the Timing breakdown of the long pole. Is most time in DNS/TLS (network problem), TTFB (server problem), or download (payload problem)? Each points to a different fix.
  5. Count serial round trips to first render. Compare to the theoretical minimum (length of the dependency chain). Every extra round trip is an opportunity.
  6. Check whether below-the-fold images and non-critical API calls have loading="lazy" or are deferred until after render.
  7. Run the same analysis with a throttled mobile connection (DevTools → Network → Slow 3G) to expose bandwidth-sensitive paths that are invisible on fast connections.

By the numbers

Make the critical-path formula concrete. The governing rule is:

critical_path_ms = Σ(serial_dependent_RTTs) # serial dependencies stack parallel_cost_ms = max(RTT_1, RTT_2, …, RTT_N) # independent calls collapse to the slowest

Scenario: a product-detail page at an e-commerce site. The client is 50 ms from the origin server. Five API calls are needed before render:

Before / After timeline table

StepSerial (before)Optimised (after)Notes
DNS60 ms8 msCDN PoP nearby resolves locally
TCP + TLS150 ms (3 RTTs)10 msPre-warmed CDN edge connection
GET /session55 ms55 msSerial dependency — cannot be parallelised
GET /product65 ms (after session)65 ms (after session)Depends on auth token from session
GET /reviews + GET /inventory + GET /recommendations60 + 58 + 70 = 188 ms (serial)max(60, 58, 70) = 70 ms (parallel)All depend only on product ID or session — dispatch together
Render80 ms80 msSame paint budget
Total to first render60+150+55+65+188+80 = 598 ms8+10+55+65+70+80 = 288 ms52% faster — entirely from network fixes, not server code

Formula:

serial_before = DNS + TLS + session + product + reviews + inventory + recs + render = 60 + 150 + 55 + 65 + 60 + 58 + 70 + 80 = 598 ms parallel_after = DNS_cdn + TLS_warm + session + product + max(reviews, inventory, recs) + render = 8 + 10 + 55 + 65 + max(60, 58, 70) + 80 = 8 + 10 + 55 + 65 + 70 + 80 = 288 ms RTTs_saved = 5 serial hops → 3 serial hops (removed 2 RTTs × 50 ms = 100 ms saved from parallelism alone) connection_saved = 240 ms saved from CDN termination (150 ms → 10 ms)

Decision math — when to collapse a dependency chain: each removed serial hop saves exactly 1 RTT. At RTT = 50 ms, collapsing the 3-call chain (reviews + inventory + recs) into one parallel step saves 2 × 50 = 100 ms regardless of server logic speed. By contrast, cutting server latency on each of those calls from 20 ms to 5 ms saves only 15 ms total — 6.7× less impact than fixing the serial dependency. The formula quantifies the decision:

savings_from_parallelism = (N_parallel_calls - 1) × RTT # remove N-1 serial hops savings_from_server_tuning = N_parallel_calls × delta_server # shave server ms on each break_even: (N-1)×RTT > N×delta_server → at N=3, RTT=50ms: 100 ms > 3×delta → delta < 33 ms → server tuning only beats parallelism when you can shave >33 ms per call (rarely true)

Sources: web.dev — HTTP/2 and performance; Chrome DevTools Network reference; High Performance Browser Networking — Ilya Grigorik, O'Reilly.

🧠 Quick check

1. A page makes 6 API calls: 2 depend on the login session, and 4 are completely independent of any other call. What is the minimum number of network round trips to load all 6?

The session call must come first (1 RTT). The 2 session-dependent calls and the 4 independent calls can all be dispatched simultaneously once the session resolves (1 more RTT = 2 total). Parallelism collapses all non-dependent calls into one round trip.

2. A JSON API response is 200 KB uncompressed. After enabling Brotli compression it becomes 28 KB. On a mobile connection with 2 Mbps available bandwidth, how much does this save in transfer time?

200 KB / (2 Mbps / 8 bits per byte) = 200 000 / 250 000 bytes/s = 0.8 s = 800 ms. 28 KB = 112 ms. Saving ≈ 688 ms. On constrained mobile connections, payload size directly translates to transfer time — compression matters enormously.

3. Which technique removes a network round trip from the critical path without changing the data the user ultimately sees?

Lazy loading removes the image requests from the critical path entirely — they happen after first render, so the user sees content without waiting for them. Compression and indexing speed up existing round trips but do not eliminate them.

4. Why does TLS termination at the CDN edge improve performance for a user far from the origin server?

TLS 1.3 takes 1 RTT to establish. At 10 ms (client to nearby PoP) that costs 10 ms. At 150 ms (client to distant origin) it costs 150 ms. The CDN PoP forwards over a pre-warmed, persistent connection. The RTT reduction is pure geography — physics, not software.

✍️ Exercise: diagnose and fix a slow page load
Scenario

A browser waterfall shows the following (all serial, no CDN): DNS 55 ms, TCP+TLS 210 ms, GET /api/user 95 ms, GET /api/feed 180 ms, GET /api/ads 60 ms, render 70 ms, lazy-load 12 images × 40 ms each = 480 ms more. Total to interactive: 670 ms (plus 480 ms for images). Identify the critical path, find the biggest win, and propose optimisations in priority order.

Model answer:

# Critical path to render (serial): DNS+TLS + /user + /feed + /ads + render
55 + 210 + 95 + 180 + 60 + 70 = 670 ms

# Optimisations in priority order (highest impact first):

1. Add CDN/edge — saves ~250 ms
   DNS: 55 ms → ~8 ms (nearby PoP)
   TCP+TLS: 210 ms → ~12 ms (warm edge connection)
   Saving: ~245 ms on the critical path

2. Parallelise /user, /feed, /ads — saves ~155 ms
   /user and /ads do not depend on each other. /feed may need user ID (1 serial dep).
   Option A: all 3 parallel if /feed only needs a session token → max(95, 180, 60) = 180 ms
   vs serial 95 + 180 + 60 = 335 ms. Saving: ~155 ms.

3. Lazy-load images — removes 480 ms from critical path entirely
   Images below the fold should not block render. Use loading="lazy" on <img> tags.
   First-render path is now: 8 + 12 + 180 + 70 = 270 ms (down from 670 ms)

4. Merge /user + /ads into one BFF call — saves 1 RTT
   If a BFF endpoint returns both user + ad data in one call, you eliminate one round trip.
   Combined with parallelism: max(merged_call, /feed).

# What did NOT help:
Server CPU was not in the trace. No DB queries shown.
Optimising server logic would have zero effect on this critical path.

Rubric: ✓ correctly identifies the critical path ✓ CDN/edge as highest-impact first step ✓ parallelism as second step ✓ lazy loading for images ✓ explicitly notes server CPU was not the bottleneck. Five of five = full marks.

Key takeaways

Sources & further reading