Performance · Lesson 03
Speeding up page & API loads
The single most common performance mistake is optimising the server while the network is the bottleneck. This lesson gives you a systematic toolkit — critical path, round-trip reduction, payload trimming, connection reuse, CDN placement, parallelism, and lazy loading — in the order that actually moves the needle.
By the end you'll be able to
- Identify the critical path of a page or API call and distinguish it from non-critical work.
- Apply each major optimisation lever — round-trip reduction, payload size, connection reuse, CDN, parallelism, lazy loading — and know which to reach for first.
- Avoid the classic trap of tuning server CPU when network round trips dominate load time.
Start with the critical path
The critical path is the longest chain of dependent work between a user action and a visible response. Nothing else matters until the critical path is short. A page that makes 10 API calls, 8 of which are independent, has a critical path of 2 serial calls — optimise those, not the 8.
Think of it as a project timeline: the critical path is the sequence of tasks where a delay in any one delays the finish. Parallel tasks off the critical path can be slow without affecting the user-visible result.
Concretely: draw your call graph. Trace the longest serial chain from request to first meaningful paint. That chain is your budget to attack. See Lesson perf-02 for how to allocate time across the chain.
Why the network dominates
From the latency reference table (see Lesson 04): a cross-continental round-trip is ~100–150 ms, a cross-region round-trip is ~30–100 ms, and a same-data-center round-trip is ~0.5–1 ms. Your server's CPU processes a request in 2–10 ms. A typical client-to-server round-trip is 30–150 ms depending on geography.
The implication is stark: one additional network round-trip costs more than 10–50× any plausible server-side optimisation. Reducing round trips is almost always the highest-leverage move before any server tuning.
Lever 1: reduce round trips
Every extra round trip is a latency tax paid at network speed. Common sources of unnecessary round trips:
- Redirect chains. HTTP 301 redirects add a full round trip each. Avoid chaining:
http://→https://→https://www.costs two RTTs before any content loads. Serve the canonical URL directly. - Multiple sequential API calls for one page. A page that makes a user-info call, waits, then makes a permissions call, waits, then makes content calls has serial round trips built into its architecture. Batch or aggregate these into a single BFF (backend-for-frontend) call where the user-agent only needs one trip.
- Synchronous auth handshakes. A new TLS connection adds a round trip. Subsequent requests on the same connection reuse it (see Lever 4 below). Avoiding new connections avoids the handshake cost.
Lever 2: reduce payload size
Smaller payloads transmit faster and parse faster, especially on mobile connections where bandwidth is the constraint.
- Compression. Enable gzip or Brotli on your API responses. A 100 KB JSON response commonly compresses to 15–20 KB. The CPU cost is negligible; the bandwidth saving is real, especially for mobile. Brotli achieves ~15–20% better ratios than gzip at equivalent CPU cost.
- Field selection. If the client only needs five fields but the endpoint returns 50, you are transmitting and parsing 10× more bytes than necessary. REST APIs can add a
?fields=id,name,avatarparameter; GraphQL makes this the default. On a list endpoint of 100 items, field trimming can reduce payload by 80%+. - Pagination. Never return unbounded result sets (covered in depth in Lesson perf-04). Returning 1,000 items when the UI shows 20 is 50× the payload and serialisation cost.
- Binary formats. For high-throughput internal APIs, Protocol Buffers or MessagePack produce 3–10× smaller payloads than equivalent JSON. The trade-off is loss of human-readability.
Lever 3: connection reuse
TCP and TLS connections have setup costs. A cold TCP+TLS connection to an HTTPS server takes 2–3 round trips before the first byte of application data can flow:
- TCP SYN / SYN-ACK / ACK: 1 RTT
- TLS ClientHello / ServerHello / Finished: 1–2 RTTs (TLS 1.3 reduced this to 1)
On a 100 ms RTT link, that's 100–300 ms of overhead before your API call even starts. Connection reuse eliminates this on subsequent requests. See Lesson 05 for the full socket lifecycle.
Practical implications: use HTTP/2 (multiplexes multiple requests over one connection) or HTTP/1.1 with Connection: keep-alive. From the client side, pool connections in your HTTP client rather than opening a new one per request. From the server side, configure appropriate keep-alive timeouts.
Lever 4: CDN and edge placement
A CDN (Content Delivery Network) places servers at Points of Presence (PoPs) geographically close to users. The key insight: the speed of light is fixed — you cannot make a packet travel from Sydney to New York faster. You can serve the Sydney user from a Sydney PoP.
Three things CDNs give you:
- Static asset caching. JS, CSS, images, fonts served from a PoP 10 ms away instead of an origin 150 ms away. One-time cache miss pays origin cost; all subsequent requests are local.
- API response caching. For read-heavy endpoints with data that changes infrequently (product details, reference data, user-public profiles), caching the API response at the CDN edge turns every response into a ~10 ms local hit.
- TLS termination at the edge. The expensive TLS handshake happens between the user and the nearby PoP (~10 ms RTT), not between the user and the distant origin (~100 ms RTT). The PoP-to-origin connection is pre-warmed and persistent. This alone can save 200–400 ms on a cold connection to a distant origin.
Lever 5: parallelise independent calls
If the critical path includes calls whose inputs do not depend on each other's outputs, dispatch them simultaneously rather than sequentially. Three 50 ms calls dispatched in parallel cost ~50 ms; dispatched serially they cost 150 ms.
The caveat (from Lesson perf-02): fan-out amplifies tail. Three parallel calls at p99 = 50 ms each give you a composite p99 ≈ 1-(0.99)^3 × requests will hit at least one tail. Budget for the composite. In practice: if your parallelism is bounded (2–4 calls), the win almost always outweighs the tail risk.
Lever 6: lazy loading
Not everything on a page needs to be loaded before the user sees anything useful. Lazy loading defers non-critical data until after the critical path has rendered:
- Below-the-fold content. Images and sections the user cannot yet see can be loaded on scroll, not on page load. On a long article page this can cut initial payload by 50–70%.
- Secondary API data. User avatar, notification count, recommendation panels — none of these are needed for the first meaningful render. Fetch them after the main content is shown.
- Low-priority background data. Telemetry, A/B experiment assignments, analytics enrichment — fire these after the critical path, not before.
Lazy loading shortens the critical path payload even if total data transfer is unchanged. The user sees a response faster; the rest loads progressively.
Diagram: critical path before and after optimisation
Worked example: optimising a product page API
# Before: three serial API calls from browser, no CDN, cold connections
Browser:
1. DNS resolve api.example.com 60 ms
2. TCP + TLS handshake (cold) 200 ms
3. GET /v1/auth/session 100 ms (blocking — client waits)
4. GET /v1/products/42 150 ms (serial on auth)
5. GET /v1/products/42/recommendations 120 ms (serial on product)
6. Render page 80 ms
─────
Total to render: 710 ms
# After: CDN, connection reuse, parallel calls, lazy recommendations
Browser:
1. DNS resolve (CDN PoP, nearby) 10 ms
2. TCP + TLS (CDN warm connection) 15 ms
3. GET /v1/auth/session ┐
GET /v1/products/42 ├ parallel max(30, 50) = 50 ms
┘
4. Render page 80 ms
─────
Critical path to render: 155 ms
5. GET /v1/products/42/recommendations ← lazy, after render
loaded in background, displayed when ready.
# Net saving: 710 ms → 155 ms to first render (78% improvement)
# Server CPU before and after: ~5 ms. Irrelevant to the outcome.
Teams spend weeks micro-optimising database queries, switching serialisers, or profiling hot loops — and then measure a 5 ms improvement on a 600 ms page load. The network round trips and TLS handshake cost 500 of those 600 ms. The server was never the bottleneck. Before tuning any server-side code, draw the full critical path (including all network hops) and verify which segment actually dominates. If it's network, fix network. If it's server, then tune server.
When asked "how would you make this page/API faster?", the most impressive answer starts with: "First I'd identify the critical path and see where time is actually spent — network hops, serial API calls, payload size, or server processing — because the right lever depends entirely on which term dominates." Then work through the levers in order: reduce round trips, add CDN/edge, parallelise independent calls, trim payloads, lazy-load non-critical data. Jumping straight to "add a cache" or "index the DB" without identifying the bottleneck is a junior signal.
Do measure first: use browser DevTools (Network tab, waterfall view) or distributed tracing to see the actual critical path. Optimise the largest block you can see. Don't assume the server is the problem because you can control it — network and client-side rendering often dwarf server time, and those are equally optimisable with CDN placement, connection reuse, and parallelism.
Under the hood: how it actually works
Critical-path mechanics come down to one rule: dependent round trips stack; independent round trips collapse. Understanding exactly what forces a new round trip — and what allows requests to share one — lets you design the waterfall rather than just observe it.
What forces a new round trip
A new network round trip is forced when:
- Data dependency: call B needs a value returned by call A (e.g. you need the session token before you can call a protected endpoint).
- Code dependency: B is dispatched inside the
then()/awaitof A, whether or not it actually needs A's data — a very common accident. - Resource discovery: the browser doesn't know a resource exists until it parses a document that references it (classic HTML/CSS/JS waterfall — each document can reference more documents).
- Connection setup: a cold connection to a new host forces DNS + TCP + TLS before the first byte of data can flow — typically 2–3 RTTs added to that host's first request.
Before/after waterfall with numbers
Concrete scenario: a product detail page. The user is 40 ms from the CDN PoP; the CDN PoP is 80 ms from the origin (total cold client-to-origin = 120 ms one-way). Original implementation has no CDN, no parallelism, cold connections.
═══════════════════════════════════════════════════════════════════
BEFORE: no CDN, cold connections, serial API calls, no lazy load
═══════════════════════════════════════════════════════════════════
RTT to origin = 120 ms (client is 120 ms from server)
Step Duration Cumulative Notes
────────────────────────────────────────────────────────
DNS 60 ms 60 ms Full recursive lookup
TCP handshake 120 ms 180 ms 1 RTT (SYN/SYN-ACK)
TLS 1.3 120 ms 300 ms 1 RTT (ClientHello/ServerHello)
GET /session 120+5 ms 425 ms 1 RTT + 5 ms server auth logic
GET /product 120+15 ms 560 ms serial on session; 15 ms DB
GET /reviews 120+10 ms 690 ms serial on product; 10 ms DB
Render 80 ms 770 ms parse + paint
Images (6) 120+20 ms 1,070 ms serial, blocking render
─────────────────────────────────────────────────────────────────
Total to interactive: 1,070 ms
Total to first render: 770 ms
Critical path (serial chain): DNS→TLS→session→product→reviews→render
═══════════════════════════════════════════════════════════════════
AFTER: CDN edge, warm connections, parallel calls, lazy images
═══════════════════════════════════════════════════════════════════
RTT to CDN PoP = 40 ms (CDN is geographically close)
Step Duration Cumulative Notes
──────────────────────────────────────────────────────────────────
DNS (CDN PoP) 8 ms 8 ms CDN handles DNS
TCP+TLS (CDN, warm) 10 ms 18 ms CDN pre-warmed connection
GET /session ┐
GET /product ├ parallel max(30+5, 30+15) = 45 ms 63 ms both at once
┘ (product at 45 ms is the slower sibling)
GET /reviews ─ parallel with above (no session dep) (inside the 45 ms window)
Render 80 ms 143 ms parse + paint
─────────────────────────────────────────────────────────────────
Total to first render: 143 ms (84% improvement)
Images: loaded lazily AFTER render — not on the critical path.
User sees content at 143 ms; images fill in as they download.
# Server CPU in both scenarios: ~5 ms total. It contributed 0.5% of the
# 1,070 ms before and 3.5% of the 143 ms after. Not the bottleneck.
Why parallelising independent calls collapses the timeline
In the serial case, three 40 ms API calls add up to 120 ms — you pay three full round trips. When you dispatch all three simultaneously after the session resolves, you pay one round trip (the slowest sibling determines wall-clock cost). The other two complete inside that window "for free".
The key insight is that the graph of dependencies, not the list of calls, determines the minimum number of round trips. Formally: the minimum number of RTTs equals the length of the longest serial dependency chain in the call graph. Anything not on that chain can be collapsed into whichever serial step it depends on.
# Call graph for the product page (edges = "requires result of")
session ──→ product ──→ reviews
└──→ related-items
└──→ inventory
# Longest serial chain: session → product → reviews = 3 RTTs minimum
# related-items and inventory depend only on product → dispatch with reviews
# Minimum RTTs = 3: (1) connection setup, (2) session, (3) product+all-dependents
# If we add a "user preferences" call that depends ONLY on the session (not product):
session ──→ product ──→ reviews
└──→ preferences ← parallel with product, no extra RTT
A page that loads main.js, which contains an import('./feature.js') that dynamically imports another module, has a three-hop dependency chain just for JavaScript: HTML → main.js → feature.js. Each hop is a full RTT. This is why bundling (collapsing many modules into one file), preload hints (<link rel="preload">), and HTTP/2 server push exist — they break the discovery waterfall so resources load in parallel rather than one-by-one as the parser encounters them.
How to debug & inspect it
Three tools give you the full waterfall picture: browser DevTools Network tab (for browser-initiated loads), curl -w (for server-side API timing), and distributed traces (for backend service calls). The critical-path analysis process is: render the waterfall, find the longest serial chain, identify the longest single bar on that chain, fix it.
Reading a DevTools / curl waterfall
In DevTools Network tab: select a request, look at the Timing sub-tab. The colour-coded bars map to:
| DevTools colour / label | What it measures | Optimization lever |
|---|---|---|
| Queueing | Waiting for a connection or for the browser to schedule the request | HTTP/2 multiplexing; reduce concurrent requests per host |
| DNS Lookup | Resolving the hostname | CDN PoP (local resolver); DNS prefetch hints |
| Initial connection | TCP handshake | CDN edge; connection reuse; HTTP/2 |
| SSL | TLS handshake | CDN TLS termination at edge; TLS 1.3 (1-RTT); session resumption |
| Request sent / Waiting (TTFB) | Time to first byte — includes server processing and network return trip | Server-side optimisation; CDN caching; edge function |
| Content download | Body transfer time | Compression; field selection; pagination; binary formats |
Slow-load symptom → cause → fix table
| Symptom (what you see in waterfall) | Root cause | Fix |
|---|---|---|
| DNS + TCP + TLS together = 200–400 ms; server processing is small | Cold connection to a far-away origin; no CDN | Add CDN with PoP close to users; CDN terminates TLS at the edge, reducing per-request connection cost to ~10 ms |
| Several API calls form a staircase (each starts after the previous ends); could be parallel | Sequential await in client code despite no data dependency |
Refactor to Promise.all / concurrent dispatch; collapse into a single BFF call if calls always happen together |
| A long chain of small requests, each starting after prior completes (e.g. JS import chains) | Resource discovery waterfall — each document references the next | Bundle JS modules; add <link rel="preload"> for critical resources; use HTTP/2 push or Early Hints for known dependencies |
| Content download bar is large even though TTFB is fast | Response body is large; bandwidth-limited (mobile) or uncompressed | Enable Brotli/gzip; implement field selection; paginate; use binary format for high-volume endpoints |
| Many requests start at the same time but only 6 complete per batch (staggered waves) | Browser per-host connection limit (HTTP/1.1 allows max 6 concurrent per domain) | Switch to HTTP/2 (multiplexes unlimited streams on one connection); or shard static assets across subdomains (HTTP/1.1 only) |
| First meaningful paint is late; images and non-critical API data are on the critical path | Lazy loading not implemented; all resources fetched synchronously on page load | Add loading="lazy" to below-the-fold images; defer non-critical API calls until after first render; use skeleton screens |
| TTFB is fast on a warm cache but spikes 10× when cache is cold (deploy or low-traffic hour) | Cold database or in-process cache; first request after deploy pays full DB cost | Warm caches on deploy (cache priming); use stale-while-revalidate to serve cached data while refreshing in background |
Debug checklist — systematic critical-path analysis:
- Open DevTools Network tab; perform a hard reload (Shift+Refresh) to simulate a cold-start load; disable cache to get worst-case timing.
- Sort by start time; draw the dependency arrows: which request could not start until the previous one completed?
- Identify the long pole: the single request on the critical path with the longest bar. This is where to focus first.
- Expand the Timing breakdown of the long pole. Is most time in DNS/TLS (network problem), TTFB (server problem), or download (payload problem)? Each points to a different fix.
- Count serial round trips to first render. Compare to the theoretical minimum (length of the dependency chain). Every extra round trip is an opportunity.
- Check whether below-the-fold images and non-critical API calls have
loading="lazy"or are deferred until after render. - Run the same analysis with a throttled mobile connection (DevTools → Network → Slow 3G) to expose bandwidth-sensitive paths that are invisible on fast connections.
By the numbers
Make the critical-path formula concrete. The governing rule is:
Scenario: a product-detail page at an e-commerce site. The client is 50 ms from the origin server. Five API calls are needed before render:
- GET /session — must come first (auth gate): 50 ms RTT + 5 ms server = 55 ms
- GET /product — depends on session (needs auth token): 50 ms RTT + 15 ms DB = 65 ms
- GET /reviews — depends on product ID: 50 ms RTT + 10 ms DB = 60 ms
- GET /inventory — depends on product ID (parallel-able with /reviews): 50 ms RTT + 8 ms = 58 ms
- GET /recommendations — independent of everything except session: 50 ms RTT + 20 ms ML = 70 ms
Before / After timeline table
| Step | Serial (before) | Optimised (after) | Notes |
|---|---|---|---|
| DNS | 60 ms | 8 ms | CDN PoP nearby resolves locally |
| TCP + TLS | 150 ms (3 RTTs) | 10 ms | Pre-warmed CDN edge connection |
| GET /session | 55 ms | 55 ms | Serial dependency — cannot be parallelised |
| GET /product | 65 ms (after session) | 65 ms (after session) | Depends on auth token from session |
| GET /reviews + GET /inventory + GET /recommendations | 60 + 58 + 70 = 188 ms (serial) | max(60, 58, 70) = 70 ms (parallel) | All depend only on product ID or session — dispatch together |
| Render | 80 ms | 80 ms | Same paint budget |
| Total to first render | 60+150+55+65+188+80 = 598 ms | 8+10+55+65+70+80 = 288 ms | 52% faster — entirely from network fixes, not server code |
Formula:
Decision math — when to collapse a dependency chain: each removed serial hop saves exactly 1 RTT. At RTT = 50 ms, collapsing the 3-call chain (reviews + inventory + recs) into one parallel step saves 2 × 50 = 100 ms regardless of server logic speed. By contrast, cutting server latency on each of those calls from 20 ms to 5 ms saves only 15 ms total — 6.7× less impact than fixing the serial dependency. The formula quantifies the decision:
Sources: web.dev — HTTP/2 and performance; Chrome DevTools Network reference; High Performance Browser Networking — Ilya Grigorik, O'Reilly.
🧠 Quick check
1. A page makes 6 API calls: 2 depend on the login session, and 4 are completely independent of any other call. What is the minimum number of network round trips to load all 6?
The session call must come first (1 RTT). The 2 session-dependent calls and the 4 independent calls can all be dispatched simultaneously once the session resolves (1 more RTT = 2 total). Parallelism collapses all non-dependent calls into one round trip.
2. A JSON API response is 200 KB uncompressed. After enabling Brotli compression it becomes 28 KB. On a mobile connection with 2 Mbps available bandwidth, how much does this save in transfer time?
200 KB / (2 Mbps / 8 bits per byte) = 200 000 / 250 000 bytes/s = 0.8 s = 800 ms. 28 KB = 112 ms. Saving ≈ 688 ms. On constrained mobile connections, payload size directly translates to transfer time — compression matters enormously.
3. Which technique removes a network round trip from the critical path without changing the data the user ultimately sees?
Lazy loading removes the image requests from the critical path entirely — they happen after first render, so the user sees content without waiting for them. Compression and indexing speed up existing round trips but do not eliminate them.
4. Why does TLS termination at the CDN edge improve performance for a user far from the origin server?
TLS 1.3 takes 1 RTT to establish. At 10 ms (client to nearby PoP) that costs 10 ms. At 150 ms (client to distant origin) it costs 150 ms. The CDN PoP forwards over a pre-warmed, persistent connection. The RTT reduction is pure geography — physics, not software.
✍️ Exercise: diagnose and fix a slow page load
A browser waterfall shows the following (all serial, no CDN): DNS 55 ms, TCP+TLS 210 ms, GET /api/user 95 ms, GET /api/feed 180 ms, GET /api/ads 60 ms, render 70 ms, lazy-load 12 images × 40 ms each = 480 ms more. Total to interactive: 670 ms (plus 480 ms for images). Identify the critical path, find the biggest win, and propose optimisations in priority order.
Model answer:
# Critical path to render (serial): DNS+TLS + /user + /feed + /ads + render
55 + 210 + 95 + 180 + 60 + 70 = 670 ms
# Optimisations in priority order (highest impact first):
1. Add CDN/edge — saves ~250 ms
DNS: 55 ms → ~8 ms (nearby PoP)
TCP+TLS: 210 ms → ~12 ms (warm edge connection)
Saving: ~245 ms on the critical path
2. Parallelise /user, /feed, /ads — saves ~155 ms
/user and /ads do not depend on each other. /feed may need user ID (1 serial dep).
Option A: all 3 parallel if /feed only needs a session token → max(95, 180, 60) = 180 ms
vs serial 95 + 180 + 60 = 335 ms. Saving: ~155 ms.
3. Lazy-load images — removes 480 ms from critical path entirely
Images below the fold should not block render. Use loading="lazy" on <img> tags.
First-render path is now: 8 + 12 + 180 + 70 = 270 ms (down from 670 ms)
4. Merge /user + /ads into one BFF call — saves 1 RTT
If a BFF endpoint returns both user + ad data in one call, you eliminate one round trip.
Combined with parallelism: max(merged_call, /feed).
# What did NOT help:
Server CPU was not in the trace. No DB queries shown.
Optimising server logic would have zero effect on this critical path.
Rubric: ✓ correctly identifies the critical path ✓ CDN/edge as highest-impact first step ✓ parallelism as second step ✓ lazy loading for images ✓ explicitly notes server CPU was not the bottleneck. Five of five = full marks.
Key takeaways
- The critical path is the longest serial chain of dependent work. Optimise the critical path; parallel non-critical work is irrelevant to user-perceived latency.
- The network dominates. A cross-region round trip is 30–100 ms; server CPU is 2–10 ms. Reducing round trips is almost always the highest-leverage move.
- CDN/edge termination cuts DNS, TLS handshake, and network latency by placing the TLS endpoint physically close to the user.
- Parallelising independent API calls collapses multiple serial round trips into one. Bounded fan-out (2–4 calls) is almost always worthwhile.
- Compression (Brotli/gzip) and field selection reduce payload size significantly — especially important on mobile where bandwidth constrains transfer time.
- Lazy loading removes non-critical data from the critical path; the user sees a response faster even though total data is the same.
- Don't tune server CPU when network dominates. Draw the full waterfall first; fix the largest block you can see.
Sources & further reading
- web.dev — Performance (comprehensive guide to web performance including critical path, resource loading, and Core Web Vitals)
- MDN — Web Performance (browser performance concepts, lazy loading, connection management)
- web.dev — Content delivery networks (CDNs)
- MDN — Connection management in HTTP/1.x (keep-alive, pipelining)
- Lesson 05 — Sockets & connections (TCP/TLS lifecycle)
- Lesson 04 — Latency vs throughput (reference latency numbers)
- Lesson perf-02 — Latency budgets (allocating time across the critical path)