Reliability & Scale · Lesson 07

Caching at Every Layer

A cache is the closest thing to a free lunch in distributed systems — the same byte of data, served a thousand times, need only be fetched once. But caches are only free when you get invalidation right, and that turns out to be one of the genuinely hard problems in computer science.

⏱ 14 min Difficulty: core Prereq: Circuit Breaker (rel-06)

By the end you'll be able to

Name the five caching layers from client to database and explain what each one protects.
Read and write HTTP Cache-Control directives and explain the ETag / 304 revalidation flow.
Choose between write-through, write-back, and cache-aside strategies and reason about their trade-offs.

Why caching exists: two problems at once

Every request that reaches your database costs money in CPU, I/O, and time. Popular data is read far more than it is written — a product page, a user profile, a configuration value. The 80/20 rule applies with brutal consistency: 20% of your data accounts for 80% of your reads. A cache exploits this asymmetry by keeping the popular 20% in fast, cheap memory so the database never sees most requests at all.

The analogy: a reference librarian who gets asked the same twenty questions every day eventually keeps a cheat sheet on her desk. The first asker went to the stacks; everyone after that gets an instant answer from the desk. The database is the stacks. The cache is the cheat sheet.

Two distinct problems are solved simultaneously:

Latency — memory is nanoseconds; disk and network are milliseconds. A cache hit is 100–10 000× faster than a cache miss.
Load — a database with 1 000 requests/s becomes a database with 50 requests/s if the hit rate is 95%. The database handles more total traffic at lower peak load.

The five layers of caching

Caching is not a single thing — it is a stack of independent stores, each closer to the user than the last. A request that misses every cache falls all the way to the database; one that hits the first layer never touches the network.

Top: the five caching layers between a user and the database. A cache miss falls through each layer until it reaches the DB; a hit at any layer stops the traversal. Bottom: the ETag revalidation flow — the client asks "do you have anything newer than this ETag?" and gets a 304 with no body if the cached copy is still fresh.

HTTP caching: Cache-Control, ETag, and 304

HTTP has a built-in caching protocol that browsers, CDNs, and reverse proxies all understand. The core mechanism is the Cache-Control response header.

# Response from an API endpoint
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-a1b2c3d4"
Last-Modified: Thu, 19 Jun 2026 08:00:00 GMT
Vary: Accept-Encoding

{ "product_id": 9, "name": "Titanium Frame", "price": 499 }

# Client re-requests 6 minutes later (past max-age=300)
GET /v1/products/9 HTTP/1.1
If-None-Match: "v3-a1b2c3d4"

# Server — data unchanged — responds without a body
HTTP/1.1 304 Not Modified
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-a1b2c3d4"

Key directives:

Directive	Meaning
`public`	Any cache (browser, CDN) may store this response.
`private`	Only the end-user's browser may cache; CDNs must not.
`no-store`	Nothing may cache this — ever. Use for sensitive data.
`no-cache`	May cache but must revalidate with the server before serving. Despite the name, it does not prevent caching.
`max-age=N`	The response is fresh for N seconds from when it was received.
`s-maxage=N`	Like max-age but applies only to shared caches (CDNs). Overrides `max-age` for CDNs.
`stale-while-revalidate=N`	Serve the stale response immediately while fetching a fresh one in the background; useful for low-latency APIs where a slightly stale response is acceptable.
`must-revalidate`	Once stale, must not serve the stale copy — must revalidate or return 504.

The ETag header is a version token for the resource — a hash, a version number, or any opaque string that changes whenever the data changes. On subsequent requests the client sends If-None-Match: "<etag>". The server compares this to the current ETag. If unchanged, it returns 304 Not Modified with no body — saving the bandwidth of the full response. If changed, it returns 200 OK with the new body and a new ETag.

The bandwidth saving is significant: a 304 response carries only headers, not the payload. A 50 KB JSON response becomes a few hundred bytes on a cache hit.

Write strategies: getting data into the cache

Reading from a cache is simple. The hard question is when and how does the cache get updated after a write? There are three main strategies:

Cache-aside (lazy loading)

The application is responsible for the cache. On a read, it checks the cache; if it misses, it reads from the database and populates the cache. On a write, it writes to the database and then invalidates (deletes) the cache entry, so the next read re-populates it with fresh data.

Pros: only data that is actually read gets cached — no wasted memory for cold data. Easy to implement. Cons: the first read after a write always hits the database (cache miss). Two operations (DB write + cache delete) are not atomic; a failed invalidation leaves stale data.

Write-through

On every write, the application writes to both the cache and the database synchronously. Reads always hit the cache; it is always up-to-date. Cons: every write pays the overhead of two stores. Data that is never read is still cached, wasting memory.

Write-back (write-behind)

Writes go to the cache immediately; the database is updated asynchronously in a background process. Very fast write latency. Cons: if the cache node fails before the DB write commits, data is lost. Complex to implement correctly. Use primarily for high-throughput counters, analytics aggregations, and other loss-tolerant workloads.

Strategy	Read latency	Write latency	Durability	Best for
Cache-aside	Miss = slow	Fast (1 write)	High	General-purpose; read-heavy
Write-through	Always fast	Slower (2 writes)	High	Consistent reads, low write volume
Write-back	Always fast	Very fast	Risk of loss	Counters, analytics, loss-tolerant

Invalidation: the genuinely hard problem

Phil Karlton's famous quip — "There are only two hard things in computer science: cache invalidation and naming things" — is funny because it is true. A cache that holds data nobody reads is harmless. A cache that serves wrong data to users is a production incident.

The three main invalidation approaches:

TTL (Time-To-Live) — every cache entry has a maximum age. After it expires the entry is evicted and the next read goes to the source of truth. Simple to implement; accepts a bounded window of staleness. The TTL is a trade-off knob: short TTL = fresher data, higher load; long TTL = more load reduction, higher staleness risk.
Explicit invalidation — on every write, delete the corresponding cache key. Guarantees freshness immediately after a write. Fails silently if the cache delete fails (retry with care; delete is idempotent). Fragile when one logical object is cached at multiple keys or derived from multiple tables.
Event-driven invalidation — write to the database, publish a "resource updated" event, cache subscribers listen and evict the relevant keys. Decoupled and scalable; adds complexity of an event bus and eventual consistency lag.

Cache stampede (thundering herd on expiry)

A cache stampede occurs when a popular cached item expires and dozens of concurrent requests all miss simultaneously, all query the database at once, all compute the same answer, and all try to write it back to the cache. The database experiences a sudden spike for exactly the duration of one slow query.

Mitigations:

Probabilistic early expiry (also called "jitter on TTL"): refresh the cache slightly before it expires, with probability proportional to how close it is to expiry. No thundering herd because one request gets ahead of the stampede.
Mutex / lock: the first miss acquires a lock; subsequent misses return the stale value (or a loading indicator) while the lock-holder recomputes. Only one database query per stampede event.
Background refresh: a background job pre-warms popular keys before they expire, so the cache is never empty.

🎯 Interview angle

"What would you cache and where?" is one of the most common system-design interview questions. A strong answer layers the response: (1) static/public content (product images, JS bundles) → CDN with long TTL; (2) per-route API responses for unauthenticated users → CDN or gateway cache, short TTL; (3) computed aggregates (user feed, search results) → Redis with TTL; (4) hot database rows (user session, config) → in-process LRU or Redis; (5) never cache authenticated per-user data at CDN (use Cache-Control: private). Mentioning invalidation trade-offs and stampede mitigation earns senior marks.

⚠️ Common trap

Caching authenticated or personalized data at a shared layer. Serving User A's account balance to User B because the response was cached at the CDN is a privacy incident. Use Cache-Control: private for any response that contains user-specific data. Only responses that are identical for all users belong in a shared (CDN or gateway) cache.

Stale data at scale. A 5-minute TTL sounds harmless. But a pricing update at minute 0 that doesn't reach users until minute 5 can mean thousands of orders placed at the wrong price. Match your TTL to the business tolerance for staleness — not to the "sounds reasonable" heuristic.

✅ Do this, not that

Do: set Vary: Accept-Encoding (and other relevant request headers) so CDNs store separate cache entries for gzip vs. non-gzip responses. Don't: omit Vary and let the CDN serve a gzip-compressed response to a client that didn't send Accept-Encoding: gzip — the client will show the user raw compressed garbage. Also: always set a TTL. A cache entry with no expiry stays forever, even after the data it represents has been deleted.

Under the hood: how the HTTP cache decision actually works

When the browser (or a CDN) receives a response, it runs a deterministic decision algorithm before touching the network again. Here is that algorithm traced end-to-end for a single resource.

Step 1 — Parse `Cache-Control` and compute the freshness lifetime

The cache reads the directives in order of precedence: s-maxage (shared caches only) > max-age > Expires header > heuristic freshness (typically 10% of Last-Modified age, capped). The result is a single integer: the freshness lifetime in seconds.

# Response received at t=0: Cache-Control: public, max-age=300, stale-while-revalidate=60 ETag: "v3-a1b2c3d4" Date: Fri, 20 Jun 2026 10:00:00 GMT # Computed: freshness_lifetime = 300 s (from max-age) stale_grace = 60 s (from stale-while-revalidate) stored_at = 1750417200 (unix)

Step 2 — Serve from cache while fresh

On every subsequent request, the cache computes current age = now − stored_at + Age_header_value. If current_age < freshness_lifetime, the response is fresh: serve it immediately, no network. The server sees zero bytes.

Step 3 — On expiry, revalidate with `If-None-Match`

At t=301 s the cached copy becomes stale. Rather than fetching the full body again, the cache sends a conditional GET using the stored ETag:

# Conditional request (t = 301 s): GET /v1/products/9 HTTP/1.1 Host: api.example.com If-None-Match: "v3-a1b2c3d4" # Server checks: is the current ETag still "v3-a1b2c3d4"? # Product price hasn't changed → YES → responds with NO BODY: HTTP/1.1 304 Not Modified Cache-Control: public, max-age=300, stale-while-revalidate=60 ETag: "v3-a1b2c3d4" Date: Fri, 20 Jun 2026 10:05:01 GMT # Bytes exchanged: request ≈ 120 B, 304 response ≈ 180 B # vs a full 200: request ≈ 120 B, 200 response ≈ 48 320 B (48 KB JSON) # Bandwidth saved: ~48 KB per revalidation — 99.4% reduction

The cache resets its freshness timer and continues serving from its copy. If the product had changed, the server returns 200 OK with the new body and a new ETag, and the cache replaces its stored copy.

The `stale-while-revalidate` grace window

Between t=300 and t=360 the entry is stale-but-within-grace. The cache serves the stale copy immediately (zero latency impact) and fires the conditional GET in the background. The user sees no delay; the next request (after the background fetch completes) gets the fresh copy. This is how high-traffic APIs can have a 5-minute cache but still feel instant during revalidation.

Cache key and `Vary`

The cache key is not just the URL. A Vary header tells the cache which request headers are part of the key. Vary: Accept-Encoding means the cache stores separate entries for gzip-compressed and uncompressed responses to the same URL. Vary: Authorization would be catastrophic — every user would get their own entry in a shared cache, defeating the purpose. The cache key algorithm is:

cache_key = method + url + SORTED(vary_header_values_from_request)
# e.g.:  GET:/v1/products/9:gzip   ≠   GET:/v1/products/9:identity

The cache decision flowchart. "Fresh" means current_age < freshness_lifetime. A conditional GET with If-None-Match costs only header bytes when the resource hasn't changed.

How to debug & inspect caching

The fastest tool is curl -I (HEAD request) or curl -I --head to read only response headers. Add -H "Cache-Control: no-cache" to force a fresh fetch and compare the headers before and after TTL expiry.

# 1. Check what the origin server sends: $ curl -sI https://api.example.com/v1/products/9 HTTP/2 200 cache-control: public, max-age=300, stale-while-revalidate=60 etag: "v3-a1b2c3d4" x-cache: MISS # CDN header — first hit, fetched from origin age: 0 # 2. Repeat after a few seconds — should now be a CDN hit: $ curl -sI https://api.example.com/v1/products/9 HTTP/2 200 cache-control: public, max-age=300 etag: "v3-a1b2c3d4" x-cache: HIT # served from CDN edge age: 47 # 47 s since origin fetch # 3. Force revalidation manually: $ curl -sI https://api.example.com/v1/products/9 \ -H "If-None-Match: \"v3-a1b2c3d4\"" HTTP/2 304 # no body — bandwidth saved # 4. Cache-bust a CDN by changing the URL query param: $ curl -sI "https://api.example.com/v1/products/9?v=4" HTTP/2 200 x-cache: MISS # new cache key, fresh fetch

In Chrome DevTools: open Network, click the request, look at the Response Headers tab. The x-cache header (CDN) and age header reveal whether the CDN served the response and how old it was. A "(disk cache)" label in the Size column means the browser served it without any network request at all.

Symptom	Likely cause	Fix
Response is never cached at CDN (`x-cache: MISS` every time)	`Cache-Control: private` or `no-store` on origin; or `Set-Cookie` header present (most CDNs skip caching when cookies are set)	Check origin response headers; remove unnecessary `Set-Cookie` from cacheable endpoints; add `public` directive
Stale data persists after a deploy / data update	Old `max-age` hasn't expired; CDN hasn't been purged; browser still has a fresh copy	Purge CDN via API after deploy; use cache-busting query param (`?v=<git-sha>`) for static assets; shorten `max-age` for mutable resources
Different users see each other's responses at CDN	User-specific response cached without `private`; or CDN ignores `Authorization` header	Add `Cache-Control: private` on any response that varies by user; add `Vary: Authorization` if you must cache per-token at a shared layer
Browser ignores `304` and re-downloads the full body	ETag format mismatch (server changed ETag format between releases); `ETag` not sent on 304 response	Ensure 304 response echoes the same `ETag`; keep ETag format stable across deploys
CDN caches a gzip response and serves it to a client that sent no `Accept-Encoding`	Missing `Vary: Accept-Encoding`	Add `Vary: Accept-Encoding` on all compressed responses
Every request reaches the origin even with correct headers	`Cache-Control: no-cache` from the request (browser hard-refresh, Ctrl+Shift+R) bypasses the CDN; or CDN is misconfigured to pass all requests through	For CDN bypass from hard-refresh: normal — the browser sends `Cache-Control: no-cache` on purpose. For always-miss at CDN: check CDN rule configuration

Debug checklist for "why isn't this cached?"

Run curl -sI <url> and read Cache-Control, Vary, ETag, x-cache, and age headers.
Check for Set-Cookie on the response — most CDNs bypass caching whenever a cookie is set.
Confirm the public directive is present and no-store / private are absent.
If caching per content-type (gzip vs plain), verify Vary: Accept-Encoding is set.
For "stale after deploy": purge the CDN and verify the ETag changed in the new response.

⚠️ Cache-busting gone wrong

A common pattern for "force fresh after deploy" is appending a build hash to API URLs: /v1/products/9?v=abc123. This works — but if you forget to also update the URL the clients call, they keep requesting the old URL and hitting the cached copy. The safer pattern for mutable API responses is a short max-age (30–300 s) with stale-while-revalidate rather than cache-busting by URL, which is best reserved for immutable assets like JS bundles.

By the numbers

Scenario: a product catalogue API at 10 000 req/s. Redis cache latency L_cache = 1 ms; origin (PostgreSQL) latency L_origin = 50 ms. Current hit ratio h = 0.90 (90% of requests served from cache).

Effective latency formula

effective_latency = h × L_cache + (1 − h) × L_origin

At h = 0.90:

effective_latency = 0.90 × 1 ms + 0.10 × 50 ms = 0.9 ms + 5.0 ms = 5.9 ms # vs. 50 ms if there were no cache at all — an 8.5× latency improvement.

Hit ratio table: latency and origin load

Origin load cut = fraction of requests that reach the database = (1 − h). At 10 000 req/s baseline:

Hit ratio h	effective_latency (ms)	Origin fraction (1−h)	Origin QPS (of 10 000)	Latency vs. no cache
0.50	0.50×1 + 0.50×50 = 25.5 ms	50%	5 000	2.0× improvement
0.80	0.80×1 + 0.20×50 = 10.8 ms	20%	2 000	4.6× improvement
0.90	0.90×1 + 0.10×50 = 5.9 ms	10%	1 000	8.5× improvement
0.99	0.99×1 + 0.01×50 = 1.5 ms	1%	100	33× improvement

The origin QPS column shows why caching is a load-reduction tool as much as a latency tool: at h=0.99 the database sees only 100 req/s instead of 10 000. Sources: AWS — Caching best practices; Redis latency benchmarks: Redis — Benchmarks.

Decision math: cache size vs. working set to reach a target hit ratio

Cache hit ratio is dominated by whether the working set (the hot keys actually requested) fits in the cache. By Zipf’s law, the top-20% of keys account for ~80% of traffic. To reach h = 0.90, you need to cover the top-10% of keys by request frequency.

Worked example: the catalogue has 1 000 000 products. Average cached value size = 2 KB. To cover the top 10% of keys:

keys_needed = 1 000 000 × 0.10 = 100 000 keys cache_size = 100 000 × 2 KB = 200 MB # A 200 MB Redis instance (well within a single r7g.large node at ~13 GB usable) # is sufficient to achieve h ≈ 0.90 on this working set. # To push h to 0.99 (top 1% of keys): # keys_needed = 10 000; cache_size = 20 MB — trivially small.

This means for most read-heavy workloads with repeated keys, an h of 0.90–0.99 is achievable with a cache that is 2–20% of total data size. The formula for the minimum cache entries needed to reach hit ratio h under a Zipf working set:

min_keys_cached = total_keys × (1 − h)   # top (1−h) fraction by frequency
cache_bytes     = min_keys_cached × avg_value_size

When caching pays off

Caching is worth its operational cost when both conditions hold:

Condition	Why it matters	Counter-example (don’t cache)
Read-heavy: reads ≫ writes	Cache amortizes the miss cost over many hits. If every key is written once and read once (h≈0), the overhead of cache misses and invalidations outweighs any benefit.	Event ingestion pipeline: each row written once, never re-read
Repeated keys: same keys hit frequently	A uniform-random access pattern has h ≈ cache_size / total_keys — tiny. Zipfian access (popular items are much more popular) gives high h with small cache.	Time-series sensor data: each timestamp is unique, no key repeats

Break-even: caching pays off when h × L_origin > L_cache + invalidation_overhead. With L_origin = 50 ms, L_cache = 1 ms, and negligible invalidation overhead: break-even at h > 1/50 = 2%. Any hit ratio above 2% makes caching faster than going to origin every time — which is why even a small in-process LRU cache is almost always worthwhile for read-heavy endpoints.

🧠 Quick check

1. A response includes Cache-Control: no-cache. What does this mean?

Despite the confusing name, no-cache does not prevent caching — it forces revalidation. The cache may store the response; it just can't serve it stale without first checking with the server. To truly prevent storage, use no-store.

2. A client stores a response with ETag "xyz99". The response is now past its max-age. What does the client send on the next request?

ETags use the If-None-Match conditional. The client says "I have this version — if you still have it, save me the bandwidth and return 304." If the data has changed, the server returns 200 with the new body and a new ETag.

3. Your API returns personalized user dashboards. Which Cache-Control directive must you include?

Personalized data is different for each user. A shared cache (CDN) would serve one user's dashboard to another. private restricts storage to the user's own browser cache only, preventing data leaks at shared caches.

4. What is a cache stampede and what is the simplest mitigation?

When a hot cache entry expires, every in-flight request that was relying on it misses simultaneously and hammers the database. Jitter on TTL spreads expirations so they don't all align. A mutex ensures only one request races to recompute while others wait or serve stale data.

5. Which write strategy has the highest risk of data loss if the cache node fails?

Write-back writes to the cache first and the database later. If the cache node fails before the background flush, the write is permanently lost. Cache-aside and write-through always write to the database synchronously before (or at the same time as) the cache, so data is safe even if the cache disappears.

✍️ Exercise: design the caching strategy for a news feed

You're building a news feed API. Each user gets a personalized feed (latest 20 articles from accounts they follow). The feed is expensive to compute — it requires joining 4 tables. Articles are published every few minutes; users expect their feed to feel "roughly real-time" (within 2 minutes is fine). Design the caching strategy: which layers, what TTLs, what write strategy, and how you handle invalidation when a user publishes a new article.

Model answer:

Layer: Application cache (Redis), not CDN — feeds are personalized (private), so no shared cache.
Key: feed:user:{user_id}:page:{page}. Hash the key to avoid key collisions if user IDs are sequential.
TTL: 90 seconds. Slightly below the "2 minute" SLA so stale feeds auto-expire. Add ±10 s jitter to prevent all users' feeds expiring in the same second (stampede mitigation).
Write strategy: Cache-aside. On a cache miss, compute the feed, store in Redis with TTL. On a new article publish, explicitly delete the feed cache keys for all followers of the author (fan-out invalidation). For authors with > 10 000 followers, use TTL expiry only (fan-out is too expensive) and accept up to 90 s staleness.
Stampede protection: When a popular user publishes, millions of followers' caches expire at once. Use a Redis SETNX mutex: the first request recomputes; others serve stale while waiting.
HTTP headers: Cache-Control: private, max-age=30, stale-while-revalidate=60 — browser caches for 30 s, allows 60 s of stale while revalidating in the background.

Rubric: ✓ correct layer choice (Redis not CDN; explains why private) ✓ TTL tied to business SLA ✓ jitter mentioned ✓ fan-out invalidation + large-follower exception ✓ stampede mitigation. Five = excellent; four = strong; three = good start but missing nuance.

Key takeaways

Caching solves two problems simultaneously: latency (memory vs disk) and load (fewer DB queries).
There are five layers: browser, CDN/edge, API gateway, application (Redis), database — each closer to the user saves more.
Cache-Control: max-age controls freshness; ETag + 304 saves bandwidth on revalidation; stale-while-revalidate hides latency while refreshing.
Use Cache-Control: private for any personalized or authenticated response — never let a shared cache serve one user's data to another.
Write strategies: cache-aside (lazy, general-purpose), write-through (always fresh, slower writes), write-back (fast writes, loss risk).
Invalidation is hard — TTL is simple but allows staleness; explicit deletion is fast but two-phase; add jitter to TTLs to avoid cache stampedes.