API Design

Reliability & Scale · Lesson 07

Caching at Every Layer

A cache is the closest thing to a free lunch in distributed systems — the same byte of data, served a thousand times, need only be fetched once. But caches are only free when you get invalidation right, and that turns out to be one of the genuinely hard problems in computer science.

⏱ 14 min Difficulty: core Prereq: Circuit Breaker (rel-06)

By the end you'll be able to

Why caching exists: two problems at once

Every request that reaches your database costs money in CPU, I/O, and time. Popular data is read far more than it is written — a product page, a user profile, a configuration value. The 80/20 rule applies with brutal consistency: 20% of your data accounts for 80% of your reads. A cache exploits this asymmetry by keeping the popular 20% in fast, cheap memory so the database never sees most requests at all.

The analogy: a reference librarian who gets asked the same twenty questions every day eventually keeps a cheat sheet on her desk. The first asker went to the stacks; everyone after that gets an instant answer from the desk. The database is the stacks. The cache is the cheat sheet.

Two distinct problems are solved simultaneously:

The five layers of caching

Caching is not a single thing — it is a stack of independent stores, each closer to the user than the last. A request that misses every cache falls all the way to the database; one that hits the first layer never touches the network.

Browser HTTP cache Service worker localStorage ~0 ms CDN / Edge Cloudflare Fastly, etc. Geo-distributed 1–20 ms API Gateway Response cache per route/TTL Auth-aware 5–50 ms App Cache In-process map Redis / Memcached Object, fragment 0.5–5 ms Database Buffer pool Query result cache 5–100 ms miss miss miss miss DB query executed Cache hit path (CDN hit example) HIT Response returned without touching app servers or DB ETag / 304 revalidation flow Client (stale) If-None-Match: "abc" Server 304 Not Modified (no body sent) Client uses cached copy
Top: the five caching layers between a user and the database. A cache miss falls through each layer until it reaches the DB; a hit at any layer stops the traversal. Bottom: the ETag revalidation flow — the client asks "do you have anything newer than this ETag?" and gets a 304 with no body if the cached copy is still fresh.

HTTP caching: Cache-Control, ETag, and 304

HTTP has a built-in caching protocol that browsers, CDNs, and reverse proxies all understand. The core mechanism is the Cache-Control response header.

# Response from an API endpoint
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-a1b2c3d4"
Last-Modified: Thu, 19 Jun 2026 08:00:00 GMT
Vary: Accept-Encoding

{ "product_id": 9, "name": "Titanium Frame", "price": 499 }

# Client re-requests 6 minutes later (past max-age=300)
GET /v1/products/9 HTTP/1.1
If-None-Match: "v3-a1b2c3d4"

# Server — data unchanged — responds without a body
HTTP/1.1 304 Not Modified
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-a1b2c3d4"

Key directives:

DirectiveMeaning
publicAny cache (browser, CDN) may store this response.
privateOnly the end-user's browser may cache; CDNs must not.
no-storeNothing may cache this — ever. Use for sensitive data.
no-cacheMay cache but must revalidate with the server before serving. Despite the name, it does not prevent caching.
max-age=NThe response is fresh for N seconds from when it was received.
s-maxage=NLike max-age but applies only to shared caches (CDNs). Overrides max-age for CDNs.
stale-while-revalidate=NServe the stale response immediately while fetching a fresh one in the background; useful for low-latency APIs where a slightly stale response is acceptable.
must-revalidateOnce stale, must not serve the stale copy — must revalidate or return 504.

The ETag header is a version token for the resource — a hash, a version number, or any opaque string that changes whenever the data changes. On subsequent requests the client sends If-None-Match: "<etag>". The server compares this to the current ETag. If unchanged, it returns 304 Not Modified with no body — saving the bandwidth of the full response. If changed, it returns 200 OK with the new body and a new ETag.

The bandwidth saving is significant: a 304 response carries only headers, not the payload. A 50 KB JSON response becomes a few hundred bytes on a cache hit.

Write strategies: getting data into the cache

Reading from a cache is simple. The hard question is when and how does the cache get updated after a write? There are three main strategies:

Cache-aside (lazy loading)

The application is responsible for the cache. On a read, it checks the cache; if it misses, it reads from the database and populates the cache. On a write, it writes to the database and then invalidates (deletes) the cache entry, so the next read re-populates it with fresh data.

Pros: only data that is actually read gets cached — no wasted memory for cold data. Easy to implement. Cons: the first read after a write always hits the database (cache miss). Two operations (DB write + cache delete) are not atomic; a failed invalidation leaves stale data.

Write-through

On every write, the application writes to both the cache and the database synchronously. Reads always hit the cache; it is always up-to-date. Cons: every write pays the overhead of two stores. Data that is never read is still cached, wasting memory.

Write-back (write-behind)

Writes go to the cache immediately; the database is updated asynchronously in a background process. Very fast write latency. Cons: if the cache node fails before the DB write commits, data is lost. Complex to implement correctly. Use primarily for high-throughput counters, analytics aggregations, and other loss-tolerant workloads.

StrategyRead latencyWrite latencyDurabilityBest for
Cache-asideMiss = slowFast (1 write)HighGeneral-purpose; read-heavy
Write-throughAlways fastSlower (2 writes)HighConsistent reads, low write volume
Write-backAlways fastVery fastRisk of lossCounters, analytics, loss-tolerant

Invalidation: the genuinely hard problem

Phil Karlton's famous quip — "There are only two hard things in computer science: cache invalidation and naming things" — is funny because it is true. A cache that holds data nobody reads is harmless. A cache that serves wrong data to users is a production incident.

The three main invalidation approaches:

Cache stampede (thundering herd on expiry)

A cache stampede occurs when a popular cached item expires and dozens of concurrent requests all miss simultaneously, all query the database at once, all compute the same answer, and all try to write it back to the cache. The database experiences a sudden spike for exactly the duration of one slow query.

Mitigations:

🎯 Interview angle

"What would you cache and where?" is one of the most common system-design interview questions. A strong answer layers the response: (1) static/public content (product images, JS bundles) → CDN with long TTL; (2) per-route API responses for unauthenticated users → CDN or gateway cache, short TTL; (3) computed aggregates (user feed, search results) → Redis with TTL; (4) hot database rows (user session, config) → in-process LRU or Redis; (5) never cache authenticated per-user data at CDN (use Cache-Control: private). Mentioning invalidation trade-offs and stampede mitigation earns senior marks.

⚠️ Common trap

Caching authenticated or personalized data at a shared layer. Serving User A's account balance to User B because the response was cached at the CDN is a privacy incident. Use Cache-Control: private for any response that contains user-specific data. Only responses that are identical for all users belong in a shared (CDN or gateway) cache.

Stale data at scale. A 5-minute TTL sounds harmless. But a pricing update at minute 0 that doesn't reach users until minute 5 can mean thousands of orders placed at the wrong price. Match your TTL to the business tolerance for staleness — not to the "sounds reasonable" heuristic.

✅ Do this, not that

Do: set Vary: Accept-Encoding (and other relevant request headers) so CDNs store separate cache entries for gzip vs. non-gzip responses. Don't: omit Vary and let the CDN serve a gzip-compressed response to a client that didn't send Accept-Encoding: gzip — the client will show the user raw compressed garbage. Also: always set a TTL. A cache entry with no expiry stays forever, even after the data it represents has been deleted.

Under the hood: how the HTTP cache decision actually works

When the browser (or a CDN) receives a response, it runs a deterministic decision algorithm before touching the network again. Here is that algorithm traced end-to-end for a single resource.

Step 1 — Parse Cache-Control and compute the freshness lifetime

The cache reads the directives in order of precedence: s-maxage (shared caches only) > max-age > Expires header > heuristic freshness (typically 10% of Last-Modified age, capped). The result is a single integer: the freshness lifetime in seconds.

# Response received at t=0: Cache-Control: public, max-age=300, stale-while-revalidate=60 ETag: "v3-a1b2c3d4" Date: Fri, 20 Jun 2026 10:00:00 GMT # Computed: freshness_lifetime = 300 s (from max-age) stale_grace = 60 s (from stale-while-revalidate) stored_at = 1750417200 (unix)

Step 2 — Serve from cache while fresh

On every subsequent request, the cache computes current age = now − stored_at + Age_header_value. If current_age < freshness_lifetime, the response is fresh: serve it immediately, no network. The server sees zero bytes.

Step 3 — On expiry, revalidate with If-None-Match

At t=301 s the cached copy becomes stale. Rather than fetching the full body again, the cache sends a conditional GET using the stored ETag:

# Conditional request (t = 301 s): GET /v1/products/9 HTTP/1.1 Host: api.example.com If-None-Match: "v3-a1b2c3d4" # Server checks: is the current ETag still "v3-a1b2c3d4"? # Product price hasn't changed → YES → responds with NO BODY: HTTP/1.1 304 Not Modified Cache-Control: public, max-age=300, stale-while-revalidate=60 ETag: "v3-a1b2c3d4" Date: Fri, 20 Jun 2026 10:05:01 GMT # Bytes exchanged: request ≈ 120 B, 304 response ≈ 180 B # vs a full 200: request ≈ 120 B, 200 response ≈ 48 320 B (48 KB JSON) # Bandwidth saved: ~48 KB per revalidation — 99.4% reduction

The cache resets its freshness timer and continues serving from its copy. If the product had changed, the server returns 200 OK with the new body and a new ETag, and the cache replaces its stored copy.

The stale-while-revalidate grace window

Between t=300 and t=360 the entry is stale-but-within-grace. The cache serves the stale copy immediately (zero latency impact) and fires the conditional GET in the background. The user sees no delay; the next request (after the background fetch completes) gets the fresh copy. This is how high-traffic APIs can have a 5-minute cache but still feel instant during revalidation.

Cache key and Vary

The cache key is not just the URL. A Vary header tells the cache which request headers are part of the key. Vary: Accept-Encoding means the cache stores separate entries for gzip-compressed and uncompressed responses to the same URL. Vary: Authorization would be catastrophic — every user would get their own entry in a shared cache, defeating the purpose. The cache key algorithm is:

cache_key = method + url + SORTED(vary_header_values_from_request)
# e.g.:  GET:/v1/products/9:gzip   ≠   GET:/v1/products/9:identity
Request arrives at cache Cached copy exists & fresh? Serve from cache — no network Send conditional GET (If-None-Match) 304 Not Modified reset timer, reuse body 200 OK + new body replace stored copy no-store / private fetch full, don't cache yes stale or not cached no-store
The cache decision flowchart. "Fresh" means current_age < freshness_lifetime. A conditional GET with If-None-Match costs only header bytes when the resource hasn't changed.

How to debug & inspect caching

The fastest tool is curl -I (HEAD request) or curl -I --head to read only response headers. Add -H "Cache-Control: no-cache" to force a fresh fetch and compare the headers before and after TTL expiry.

# 1. Check what the origin server sends: $ curl -sI https://api.example.com/v1/products/9 HTTP/2 200 cache-control: public, max-age=300, stale-while-revalidate=60 etag: "v3-a1b2c3d4" x-cache: MISS # CDN header — first hit, fetched from origin age: 0 # 2. Repeat after a few seconds — should now be a CDN hit: $ curl -sI https://api.example.com/v1/products/9 HTTP/2 200 cache-control: public, max-age=300 etag: "v3-a1b2c3d4" x-cache: HIT # served from CDN edge age: 47 # 47 s since origin fetch # 3. Force revalidation manually: $ curl -sI https://api.example.com/v1/products/9 \ -H "If-None-Match: \"v3-a1b2c3d4\"" HTTP/2 304 # no body — bandwidth saved # 4. Cache-bust a CDN by changing the URL query param: $ curl -sI "https://api.example.com/v1/products/9?v=4" HTTP/2 200 x-cache: MISS # new cache key, fresh fetch

In Chrome DevTools: open Network, click the request, look at the Response Headers tab. The x-cache header (CDN) and age header reveal whether the CDN served the response and how old it was. A "(disk cache)" label in the Size column means the browser served it without any network request at all.

SymptomLikely causeFix
Response is never cached at CDN (x-cache: MISS every time) Cache-Control: private or no-store on origin; or Set-Cookie header present (most CDNs skip caching when cookies are set) Check origin response headers; remove unnecessary Set-Cookie from cacheable endpoints; add public directive
Stale data persists after a deploy / data update Old max-age hasn't expired; CDN hasn't been purged; browser still has a fresh copy Purge CDN via API after deploy; use cache-busting query param (?v=<git-sha>) for static assets; shorten max-age for mutable resources
Different users see each other's responses at CDN User-specific response cached without private; or CDN ignores Authorization header Add Cache-Control: private on any response that varies by user; add Vary: Authorization if you must cache per-token at a shared layer
Browser ignores 304 and re-downloads the full body ETag format mismatch (server changed ETag format between releases); ETag not sent on 304 response Ensure 304 response echoes the same ETag; keep ETag format stable across deploys
CDN caches a gzip response and serves it to a client that sent no Accept-Encoding Missing Vary: Accept-Encoding Add Vary: Accept-Encoding on all compressed responses
Every request reaches the origin even with correct headers Cache-Control: no-cache from the request (browser hard-refresh, Ctrl+Shift+R) bypasses the CDN; or CDN is misconfigured to pass all requests through For CDN bypass from hard-refresh: normal — the browser sends Cache-Control: no-cache on purpose. For always-miss at CDN: check CDN rule configuration

Debug checklist for "why isn't this cached?"

  1. Run curl -sI <url> and read Cache-Control, Vary, ETag, x-cache, and age headers.
  2. Check for Set-Cookie on the response — most CDNs bypass caching whenever a cookie is set.
  3. Confirm the public directive is present and no-store / private are absent.
  4. If caching per content-type (gzip vs plain), verify Vary: Accept-Encoding is set.
  5. For "stale after deploy": purge the CDN and verify the ETag changed in the new response.
⚠️ Cache-busting gone wrong

A common pattern for "force fresh after deploy" is appending a build hash to API URLs: /v1/products/9?v=abc123. This works — but if you forget to also update the URL the clients call, they keep requesting the old URL and hitting the cached copy. The safer pattern for mutable API responses is a short max-age (30–300 s) with stale-while-revalidate rather than cache-busting by URL, which is best reserved for immutable assets like JS bundles.

By the numbers

Scenario: a product catalogue API at 10 000 req/s. Redis cache latency L_cache = 1 ms; origin (PostgreSQL) latency L_origin = 50 ms. Current hit ratio h = 0.90 (90% of requests served from cache).

Effective latency formula

effective_latency = h × L_cache + (1 − h) × L_origin

At h = 0.90:

effective_latency = 0.90 × 1 ms + 0.10 × 50 ms = 0.9 ms + 5.0 ms = 5.9 ms # vs. 50 ms if there were no cache at all — an 8.5× latency improvement.

Hit ratio table: latency and origin load

Origin load cut = fraction of requests that reach the database = (1 − h). At 10 000 req/s baseline:

Hit ratio heffective_latency (ms)Origin fraction (1−h)Origin QPS (of 10 000)Latency vs. no cache
0.500.50×1 + 0.50×50 = 25.5 ms50%5 0002.0× improvement
0.800.80×1 + 0.20×50 = 10.8 ms20%2 0004.6× improvement
0.900.90×1 + 0.10×50 = 5.9 ms10%1 0008.5× improvement
0.990.99×1 + 0.01×50 = 1.5 ms1%10033× improvement

The origin QPS column shows why caching is a load-reduction tool as much as a latency tool: at h=0.99 the database sees only 100 req/s instead of 10 000. Sources: AWS — Caching best practices; Redis latency benchmarks: Redis — Benchmarks.

Decision math: cache size vs. working set to reach a target hit ratio

Cache hit ratio is dominated by whether the working set (the hot keys actually requested) fits in the cache. By Zipf’s law, the top-20% of keys account for ~80% of traffic. To reach h = 0.90, you need to cover the top-10% of keys by request frequency.

Worked example: the catalogue has 1 000 000 products. Average cached value size = 2 KB. To cover the top 10% of keys:

keys_needed = 1 000 000 × 0.10 = 100 000 keys cache_size = 100 000 × 2 KB = 200 MB # A 200 MB Redis instance (well within a single r7g.large node at ~13 GB usable) # is sufficient to achieve h ≈ 0.90 on this working set. # To push h to 0.99 (top 1% of keys): # keys_needed = 10 000; cache_size = 20 MB — trivially small.

This means for most read-heavy workloads with repeated keys, an h of 0.90–0.99 is achievable with a cache that is 2–20% of total data size. The formula for the minimum cache entries needed to reach hit ratio h under a Zipf working set:

min_keys_cached = total_keys × (1 − h)   # top (1−h) fraction by frequency
cache_bytes     = min_keys_cached × avg_value_size

When caching pays off

Caching is worth its operational cost when both conditions hold:

ConditionWhy it mattersCounter-example (don’t cache)
Read-heavy: reads ≫ writesCache amortizes the miss cost over many hits. If every key is written once and read once (h≈0), the overhead of cache misses and invalidations outweighs any benefit.Event ingestion pipeline: each row written once, never re-read
Repeated keys: same keys hit frequentlyA uniform-random access pattern has h ≈ cache_size / total_keys — tiny. Zipfian access (popular items are much more popular) gives high h with small cache.Time-series sensor data: each timestamp is unique, no key repeats

Break-even: caching pays off when h × L_origin > L_cache + invalidation_overhead. With L_origin = 50 ms, L_cache = 1 ms, and negligible invalidation overhead: break-even at h > 1/50 = 2%. Any hit ratio above 2% makes caching faster than going to origin every time — which is why even a small in-process LRU cache is almost always worthwhile for read-heavy endpoints.

🧠 Quick check

1. A response includes Cache-Control: no-cache. What does this mean?

Despite the confusing name, no-cache does not prevent caching — it forces revalidation. The cache may store the response; it just can't serve it stale without first checking with the server. To truly prevent storage, use no-store.

2. A client stores a response with ETag "xyz99". The response is now past its max-age. What does the client send on the next request?

ETags use the If-None-Match conditional. The client says "I have this version — if you still have it, save me the bandwidth and return 304." If the data has changed, the server returns 200 with the new body and a new ETag.

3. Your API returns personalized user dashboards. Which Cache-Control directive must you include?

Personalized data is different for each user. A shared cache (CDN) would serve one user's dashboard to another. private restricts storage to the user's own browser cache only, preventing data leaks at shared caches.

4. What is a cache stampede and what is the simplest mitigation?

When a hot cache entry expires, every in-flight request that was relying on it misses simultaneously and hammers the database. Jitter on TTL spreads expirations so they don't all align. A mutex ensures only one request races to recompute while others wait or serve stale data.

5. Which write strategy has the highest risk of data loss if the cache node fails?

Write-back writes to the cache first and the database later. If the cache node fails before the background flush, the write is permanently lost. Cache-aside and write-through always write to the database synchronously before (or at the same time as) the cache, so data is safe even if the cache disappears.

✍️ Exercise: design the caching strategy for a news feed

You're building a news feed API. Each user gets a personalized feed (latest 20 articles from accounts they follow). The feed is expensive to compute — it requires joining 4 tables. Articles are published every few minutes; users expect their feed to feel "roughly real-time" (within 2 minutes is fine). Design the caching strategy: which layers, what TTLs, what write strategy, and how you handle invalidation when a user publishes a new article.

Model answer:

Rubric: ✓ correct layer choice (Redis not CDN; explains why private) ✓ TTL tied to business SLA ✓ jitter mentioned ✓ fan-out invalidation + large-follower exception ✓ stampede mitigation. Five = excellent; four = strong; three = good start but missing nuance.

Key takeaways

Sources & further reading