Reliability & Scale · Lesson 07
Caching at Every Layer
A cache is the closest thing to a free lunch in distributed systems — the same byte of data, served a thousand times, need only be fetched once. But caches are only free when you get invalidation right, and that turns out to be one of the genuinely hard problems in computer science.
By the end you'll be able to
- Name the five caching layers from client to database and explain what each one protects.
- Read and write HTTP
Cache-Controldirectives and explain the ETag / 304 revalidation flow. - Choose between write-through, write-back, and cache-aside strategies and reason about their trade-offs.
Why caching exists: two problems at once
Every request that reaches your database costs money in CPU, I/O, and time. Popular data is read far more than it is written — a product page, a user profile, a configuration value. The 80/20 rule applies with brutal consistency: 20% of your data accounts for 80% of your reads. A cache exploits this asymmetry by keeping the popular 20% in fast, cheap memory so the database never sees most requests at all.
The analogy: a reference librarian who gets asked the same twenty questions every day eventually keeps a cheat sheet on her desk. The first asker went to the stacks; everyone after that gets an instant answer from the desk. The database is the stacks. The cache is the cheat sheet.
Two distinct problems are solved simultaneously:
- Latency — memory is nanoseconds; disk and network are milliseconds. A cache hit is 100–10 000× faster than a cache miss.
- Load — a database with 1 000 requests/s becomes a database with 50 requests/s if the hit rate is 95%. The database handles more total traffic at lower peak load.
The five layers of caching
Caching is not a single thing — it is a stack of independent stores, each closer to the user than the last. A request that misses every cache falls all the way to the database; one that hits the first layer never touches the network.
HTTP caching: Cache-Control, ETag, and 304
HTTP has a built-in caching protocol that browsers, CDNs, and reverse proxies all understand. The core mechanism is the Cache-Control response header.
# Response from an API endpoint
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-a1b2c3d4"
Last-Modified: Thu, 19 Jun 2026 08:00:00 GMT
Vary: Accept-Encoding
{ "product_id": 9, "name": "Titanium Frame", "price": 499 }
# Client re-requests 6 minutes later (past max-age=300)
GET /v1/products/9 HTTP/1.1
If-None-Match: "v3-a1b2c3d4"
# Server — data unchanged — responds without a body
HTTP/1.1 304 Not Modified
Cache-Control: public, max-age=300, stale-while-revalidate=60
ETag: "v3-a1b2c3d4"
Key directives:
| Directive | Meaning |
|---|---|
public | Any cache (browser, CDN) may store this response. |
private | Only the end-user's browser may cache; CDNs must not. |
no-store | Nothing may cache this — ever. Use for sensitive data. |
no-cache | May cache but must revalidate with the server before serving. Despite the name, it does not prevent caching. |
max-age=N | The response is fresh for N seconds from when it was received. |
s-maxage=N | Like max-age but applies only to shared caches (CDNs). Overrides max-age for CDNs. |
stale-while-revalidate=N | Serve the stale response immediately while fetching a fresh one in the background; useful for low-latency APIs where a slightly stale response is acceptable. |
must-revalidate | Once stale, must not serve the stale copy — must revalidate or return 504. |
The ETag header is a version token for the resource — a hash, a version number, or any opaque string that changes whenever the data changes. On subsequent requests the client sends If-None-Match: "<etag>". The server compares this to the current ETag. If unchanged, it returns 304 Not Modified with no body — saving the bandwidth of the full response. If changed, it returns 200 OK with the new body and a new ETag.
The bandwidth saving is significant: a 304 response carries only headers, not the payload. A 50 KB JSON response becomes a few hundred bytes on a cache hit.
Write strategies: getting data into the cache
Reading from a cache is simple. The hard question is when and how does the cache get updated after a write? There are three main strategies:
Cache-aside (lazy loading)
The application is responsible for the cache. On a read, it checks the cache; if it misses, it reads from the database and populates the cache. On a write, it writes to the database and then invalidates (deletes) the cache entry, so the next read re-populates it with fresh data.
Pros: only data that is actually read gets cached — no wasted memory for cold data. Easy to implement. Cons: the first read after a write always hits the database (cache miss). Two operations (DB write + cache delete) are not atomic; a failed invalidation leaves stale data.
Write-through
On every write, the application writes to both the cache and the database synchronously. Reads always hit the cache; it is always up-to-date. Cons: every write pays the overhead of two stores. Data that is never read is still cached, wasting memory.
Write-back (write-behind)
Writes go to the cache immediately; the database is updated asynchronously in a background process. Very fast write latency. Cons: if the cache node fails before the DB write commits, data is lost. Complex to implement correctly. Use primarily for high-throughput counters, analytics aggregations, and other loss-tolerant workloads.
| Strategy | Read latency | Write latency | Durability | Best for |
|---|---|---|---|---|
| Cache-aside | Miss = slow | Fast (1 write) | High | General-purpose; read-heavy |
| Write-through | Always fast | Slower (2 writes) | High | Consistent reads, low write volume |
| Write-back | Always fast | Very fast | Risk of loss | Counters, analytics, loss-tolerant |
Invalidation: the genuinely hard problem
Phil Karlton's famous quip — "There are only two hard things in computer science: cache invalidation and naming things" — is funny because it is true. A cache that holds data nobody reads is harmless. A cache that serves wrong data to users is a production incident.
The three main invalidation approaches:
- TTL (Time-To-Live) — every cache entry has a maximum age. After it expires the entry is evicted and the next read goes to the source of truth. Simple to implement; accepts a bounded window of staleness. The TTL is a trade-off knob: short TTL = fresher data, higher load; long TTL = more load reduction, higher staleness risk.
- Explicit invalidation — on every write, delete the corresponding cache key. Guarantees freshness immediately after a write. Fails silently if the cache delete fails (retry with care; delete is idempotent). Fragile when one logical object is cached at multiple keys or derived from multiple tables.
- Event-driven invalidation — write to the database, publish a "resource updated" event, cache subscribers listen and evict the relevant keys. Decoupled and scalable; adds complexity of an event bus and eventual consistency lag.
Cache stampede (thundering herd on expiry)
A cache stampede occurs when a popular cached item expires and dozens of concurrent requests all miss simultaneously, all query the database at once, all compute the same answer, and all try to write it back to the cache. The database experiences a sudden spike for exactly the duration of one slow query.
Mitigations:
- Probabilistic early expiry (also called "jitter on TTL"): refresh the cache slightly before it expires, with probability proportional to how close it is to expiry. No thundering herd because one request gets ahead of the stampede.
- Mutex / lock: the first miss acquires a lock; subsequent misses return the stale value (or a loading indicator) while the lock-holder recomputes. Only one database query per stampede event.
- Background refresh: a background job pre-warms popular keys before they expire, so the cache is never empty.
"What would you cache and where?" is one of the most common system-design interview questions. A strong answer layers the response: (1) static/public content (product images, JS bundles) → CDN with long TTL; (2) per-route API responses for unauthenticated users → CDN or gateway cache, short TTL; (3) computed aggregates (user feed, search results) → Redis with TTL; (4) hot database rows (user session, config) → in-process LRU or Redis; (5) never cache authenticated per-user data at CDN (use Cache-Control: private). Mentioning invalidation trade-offs and stampede mitigation earns senior marks.
Caching authenticated or personalized data at a shared layer. Serving User A's account balance to User B because the response was cached at the CDN is a privacy incident. Use Cache-Control: private for any response that contains user-specific data. Only responses that are identical for all users belong in a shared (CDN or gateway) cache.
Stale data at scale. A 5-minute TTL sounds harmless. But a pricing update at minute 0 that doesn't reach users until minute 5 can mean thousands of orders placed at the wrong price. Match your TTL to the business tolerance for staleness — not to the "sounds reasonable" heuristic.
Do: set Vary: Accept-Encoding (and other relevant request headers) so CDNs store separate cache entries for gzip vs. non-gzip responses. Don't: omit Vary and let the CDN serve a gzip-compressed response to a client that didn't send Accept-Encoding: gzip — the client will show the user raw compressed garbage. Also: always set a TTL. A cache entry with no expiry stays forever, even after the data it represents has been deleted.
Under the hood: how the HTTP cache decision actually works
When the browser (or a CDN) receives a response, it runs a deterministic decision algorithm before touching the network again. Here is that algorithm traced end-to-end for a single resource.
Step 1 — Parse Cache-Control and compute the freshness lifetime
The cache reads the directives in order of precedence: s-maxage (shared caches only) > max-age > Expires header > heuristic freshness (typically 10% of Last-Modified age, capped). The result is a single integer: the freshness lifetime in seconds.
Step 2 — Serve from cache while fresh
On every subsequent request, the cache computes current age = now − stored_at + Age_header_value. If current_age < freshness_lifetime, the response is fresh: serve it immediately, no network. The server sees zero bytes.
Step 3 — On expiry, revalidate with If-None-Match
At t=301 s the cached copy becomes stale. Rather than fetching the full body again, the cache sends a conditional GET using the stored ETag:
The cache resets its freshness timer and continues serving from its copy. If the product had changed, the server returns 200 OK with the new body and a new ETag, and the cache replaces its stored copy.
The stale-while-revalidate grace window
Between t=300 and t=360 the entry is stale-but-within-grace. The cache serves the stale copy immediately (zero latency impact) and fires the conditional GET in the background. The user sees no delay; the next request (after the background fetch completes) gets the fresh copy. This is how high-traffic APIs can have a 5-minute cache but still feel instant during revalidation.
Cache key and Vary
The cache key is not just the URL. A Vary header tells the cache which request headers are part of the key. Vary: Accept-Encoding means the cache stores separate entries for gzip-compressed and uncompressed responses to the same URL. Vary: Authorization would be catastrophic — every user would get their own entry in a shared cache, defeating the purpose. The cache key algorithm is:
cache_key = method + url + SORTED(vary_header_values_from_request)
# e.g.: GET:/v1/products/9:gzip ≠ GET:/v1/products/9:identity
current_age < freshness_lifetime. A conditional GET with If-None-Match costs only header bytes when the resource hasn't changed.How to debug & inspect caching
The fastest tool is curl -I (HEAD request) or curl -I --head to read only response headers. Add -H "Cache-Control: no-cache" to force a fresh fetch and compare the headers before and after TTL expiry.
In Chrome DevTools: open Network, click the request, look at the Response Headers tab. The x-cache header (CDN) and age header reveal whether the CDN served the response and how old it was. A "(disk cache)" label in the Size column means the browser served it without any network request at all.
| Symptom | Likely cause | Fix |
|---|---|---|
Response is never cached at CDN (x-cache: MISS every time) |
Cache-Control: private or no-store on origin; or Set-Cookie header present (most CDNs skip caching when cookies are set) |
Check origin response headers; remove unnecessary Set-Cookie from cacheable endpoints; add public directive |
| Stale data persists after a deploy / data update | Old max-age hasn't expired; CDN hasn't been purged; browser still has a fresh copy |
Purge CDN via API after deploy; use cache-busting query param (?v=<git-sha>) for static assets; shorten max-age for mutable resources |
| Different users see each other's responses at CDN | User-specific response cached without private; or CDN ignores Authorization header |
Add Cache-Control: private on any response that varies by user; add Vary: Authorization if you must cache per-token at a shared layer |
Browser ignores 304 and re-downloads the full body |
ETag format mismatch (server changed ETag format between releases); ETag not sent on 304 response |
Ensure 304 response echoes the same ETag; keep ETag format stable across deploys |
CDN caches a gzip response and serves it to a client that sent no Accept-Encoding |
Missing Vary: Accept-Encoding |
Add Vary: Accept-Encoding on all compressed responses |
| Every request reaches the origin even with correct headers | Cache-Control: no-cache from the request (browser hard-refresh, Ctrl+Shift+R) bypasses the CDN; or CDN is misconfigured to pass all requests through |
For CDN bypass from hard-refresh: normal — the browser sends Cache-Control: no-cache on purpose. For always-miss at CDN: check CDN rule configuration |
Debug checklist for "why isn't this cached?"
- Run
curl -sI <url>and readCache-Control,Vary,ETag,x-cache, andageheaders. - Check for
Set-Cookieon the response — most CDNs bypass caching whenever a cookie is set. - Confirm the
publicdirective is present andno-store/privateare absent. - If caching per content-type (gzip vs plain), verify
Vary: Accept-Encodingis set. - For "stale after deploy": purge the CDN and verify the ETag changed in the new response.
A common pattern for "force fresh after deploy" is appending a build hash to API URLs: /v1/products/9?v=abc123. This works — but if you forget to also update the URL the clients call, they keep requesting the old URL and hitting the cached copy. The safer pattern for mutable API responses is a short max-age (30–300 s) with stale-while-revalidate rather than cache-busting by URL, which is best reserved for immutable assets like JS bundles.
By the numbers
Scenario: a product catalogue API at 10 000 req/s. Redis cache latency L_cache = 1 ms; origin (PostgreSQL) latency L_origin = 50 ms. Current hit ratio h = 0.90 (90% of requests served from cache).
Effective latency formula
effective_latency = h × L_cache + (1 − h) × L_origin
At h = 0.90:
Hit ratio table: latency and origin load
Origin load cut = fraction of requests that reach the database = (1 − h). At 10 000 req/s baseline:
| Hit ratio h | effective_latency (ms) | Origin fraction (1−h) | Origin QPS (of 10 000) | Latency vs. no cache |
|---|---|---|---|---|
| 0.50 | 0.50×1 + 0.50×50 = 25.5 ms | 50% | 5 000 | 2.0× improvement |
| 0.80 | 0.80×1 + 0.20×50 = 10.8 ms | 20% | 2 000 | 4.6× improvement |
| 0.90 | 0.90×1 + 0.10×50 = 5.9 ms | 10% | 1 000 | 8.5× improvement |
| 0.99 | 0.99×1 + 0.01×50 = 1.5 ms | 1% | 100 | 33× improvement |
The origin QPS column shows why caching is a load-reduction tool as much as a latency tool: at h=0.99 the database sees only 100 req/s instead of 10 000. Sources: AWS — Caching best practices; Redis latency benchmarks: Redis — Benchmarks.
Decision math: cache size vs. working set to reach a target hit ratio
Cache hit ratio is dominated by whether the working set (the hot keys actually requested) fits in the cache. By Zipf’s law, the top-20% of keys account for ~80% of traffic. To reach h = 0.90, you need to cover the top-10% of keys by request frequency.
Worked example: the catalogue has 1 000 000 products. Average cached value size = 2 KB. To cover the top 10% of keys:
This means for most read-heavy workloads with repeated keys, an h of 0.90–0.99 is achievable with a cache that is 2–20% of total data size. The formula for the minimum cache entries needed to reach hit ratio h under a Zipf working set:
min_keys_cached = total_keys × (1 − h) # top (1−h) fraction by frequency
cache_bytes = min_keys_cached × avg_value_size
When caching pays off
Caching is worth its operational cost when both conditions hold:
| Condition | Why it matters | Counter-example (don’t cache) |
|---|---|---|
| Read-heavy: reads ≫ writes | Cache amortizes the miss cost over many hits. If every key is written once and read once (h≈0), the overhead of cache misses and invalidations outweighs any benefit. | Event ingestion pipeline: each row written once, never re-read |
| Repeated keys: same keys hit frequently | A uniform-random access pattern has h ≈ cache_size / total_keys — tiny. Zipfian access (popular items are much more popular) gives high h with small cache. | Time-series sensor data: each timestamp is unique, no key repeats |
Break-even: caching pays off when h × L_origin > L_cache + invalidation_overhead. With L_origin = 50 ms, L_cache = 1 ms, and negligible invalidation overhead: break-even at h > 1/50 = 2%. Any hit ratio above 2% makes caching faster than going to origin every time — which is why even a small in-process LRU cache is almost always worthwhile for read-heavy endpoints.
Multi-tenant: the cache key must include the tenant
The single most common caching security bug in multi-tenant APIs is the cross-tenant cache leak: a per-user response is cached under a key that does not include the user's identity, and a different user's request gets a cache hit — receiving the first user's private data. This is not a theoretical concern; it surfaces regularly in production systems where caching is added to an existing authenticated endpoint without adjusting the key scheme.
The bug: caching GET /v1/me under the path alone
Two users — uA (Alice, user_id: 1001) and uB (Bob, user_id: 2055) — both call GET /v1/me to fetch their own profiles. A naive Redis cache stores the response under the key "cache:GET:/v1/me" — no user identity, just method + path.
| t (ms) | User | Request | Cache key looked up | Cache result | Response returned |
|---|---|---|---|---|---|
| 0 | uA (Alice) | GET /v1/me with Authorization: Bearer tok_A | "cache:GET:/v1/me" | MISS — key absent | Alice's profile fetched from DB; stored under "cache:GET:/v1/me" |
| 120 | uB (Bob) | GET /v1/me with Authorization: Bearer tok_B | "cache:GET:/v1/me" | HIT — returns Alice's profile ← BUG | Bob receives Alice's email, plan, and billing_address |
The cache has no concept of who made the request — it only sees the key. If the key is the same for two different users, the first writer wins and every subsequent reader gets the same response. The cache is operating correctly; the key design is wrong. This is categorically a security incident, not just a data-freshness issue.
The fix: key = (user_id, method, path [, relevant headers])
The corrected key scheme binds the cache entry to the tenant's identity extracted from the authenticated context (never from a user-supplied value in the URL):
# Correct: derive user identity from the validated token, not the path
function cache_key(user_id, method, path, vary_headers):
base = method + ":" + path
return "cache:" + user_id + ":" + base
# e.g. "cache:1001:GET:/v1/me" and "cache:2055:GET:/v1/me"
# → completely separate entries; uB can never read uA's slot
Corrected trace with the same two users:
| t (ms) | User | Request | Cache key looked up | Cache result | Response returned |
|---|---|---|---|---|---|
| 0 | uA (Alice) | GET /v1/me tok_A | "cache:1001:GET:/v1/me" | MISS | Alice's profile from DB; stored under her key |
| 120 | uB (Bob) | GET /v1/me tok_B | "cache:2055:GET:/v1/me" | MISS — different key | Bob's profile from DB; stored under his key |
| 850 | uA (Alice) | GET /v1/me tok_A | "cache:1001:GET:/v1/me" | HIT | Alice's own cached profile — correct |
| 980 | uB (Bob) | GET /v1/me tok_B | "cache:2055:GET:/v1/me" | HIT | Bob's own cached profile — correct |
The Vary header and Cache-Control: private for shared caches
The same isolation principle applies at HTTP caches (CDNs, reverse proxies). Two tools enforce it:
Cache-Control: private— instructs any shared cache (CDN, proxy) that this response must not be stored at all; only the end-user's own browser may cache it. This is the correct directive for any response containing per-user data. Addingprivateis simpler than configuring a shared cache to key by identity.Vary: Authorization— if you do need a shared cache to store per-token responses (unusual), theVaryheader tells the CDN to treat each distinct value of theAuthorizationheader as a separate cache key. In practice most CDN vendors either ignoreVary: Authorizationor cache the first value and serve it to everyone — makingCache-Control: privatethe far safer choice. See MDN — Vary header and OWASP — Cache Poisoning / Web Messaging for attack scenarios.
Any response that differs by user must have Cache-Control: private on the HTTP layer and a tenant-scoped key on the application cache layer. The two controls operate at different layers and neither substitutes for the other: private guards the CDN edge; a keyed application cache guards the Redis or in-process layer.
By the numbers: the cost of per-tenant key isolation
With 50,000 active users each having a separate cache entry for GET /v1/me, and an average profile size of 1 KB:
Play with the cache hit-ratio simulator — drag the load toward 100M req/s and watch this behaviour in real time.
🧠 Quick check
1. A response includes Cache-Control: no-cache. What does this mean?
Despite the confusing name, no-cache does not prevent caching — it forces revalidation. The cache may store the response; it just can't serve it stale without first checking with the server. To truly prevent storage, use no-store.
2. A client stores a response with ETag "xyz99". The response is now past its max-age. What does the client send on the next request?
ETags use the If-None-Match conditional. The client says "I have this version — if you still have it, save me the bandwidth and return 304." If the data has changed, the server returns 200 with the new body and a new ETag.
3. Your API returns personalized user dashboards. Which Cache-Control directive must you include?
Personalized data is different for each user. A shared cache (CDN) would serve one user's dashboard to another. private restricts storage to the user's own browser cache only, preventing data leaks at shared caches.
4. What is a cache stampede and what is the simplest mitigation?
When a hot cache entry expires, every in-flight request that was relying on it misses simultaneously and hammers the database. Jitter on TTL spreads expirations so they don't all align. A mutex ensures only one request races to recompute while others wait or serve stale data.
5. Which write strategy has the highest risk of data loss if the cache node fails?
Write-back writes to the cache first and the database later. If the cache node fails before the background flush, the write is permanently lost. Cache-aside and write-through always write to the database synchronously before (or at the same time as) the cache, so data is safe even if the cache disappears.
✍️ Exercise: design the caching strategy for a news feed
You're building a news feed API. Each user gets a personalized feed (latest 20 articles from accounts they follow). The feed is expensive to compute — it requires joining 4 tables. Articles are published every few minutes; users expect their feed to feel "roughly real-time" (within 2 minutes is fine). Design the caching strategy: which layers, what TTLs, what write strategy, and how you handle invalidation when a user publishes a new article.
Model answer:
- Layer: Application cache (Redis), not CDN — feeds are personalized (
private), so no shared cache. - Key:
feed:user:{user_id}:page:{page}. Hash the key to avoid key collisions if user IDs are sequential. - TTL: 90 seconds. Slightly below the "2 minute" SLA so stale feeds auto-expire. Add ±10 s jitter to prevent all users' feeds expiring in the same second (stampede mitigation).
- Write strategy: Cache-aside. On a cache miss, compute the feed, store in Redis with TTL. On a new article publish, explicitly delete the feed cache keys for all followers of the author (fan-out invalidation). For authors with > 10 000 followers, use TTL expiry only (fan-out is too expensive) and accept up to 90 s staleness.
- Stampede protection: When a popular user publishes, millions of followers' caches expire at once. Use a Redis SETNX mutex: the first request recomputes; others serve stale while waiting.
- HTTP headers:
Cache-Control: private, max-age=30, stale-while-revalidate=60— browser caches for 30 s, allows 60 s of stale while revalidating in the background.
Rubric: ✓ correct layer choice (Redis not CDN; explains why private) ✓ TTL tied to business SLA ✓ jitter mentioned ✓ fan-out invalidation + large-follower exception ✓ stampede mitigation. Five = excellent; four = strong; three = good start but missing nuance.
Key takeaways
- Caching solves two problems simultaneously: latency (memory vs disk) and load (fewer DB queries).
- There are five layers: browser, CDN/edge, API gateway, application (Redis), database — each closer to the user saves more.
Cache-Control: max-agecontrols freshness; ETag + 304 saves bandwidth on revalidation;stale-while-revalidatehides latency while refreshing.- Use
Cache-Control: privatefor any personalized or authenticated response — never let a shared cache serve one user's data to another. - Write strategies: cache-aside (lazy, general-purpose), write-through (always fresh, slower writes), write-back (fast writes, loss risk).
- Invalidation is hard — TTL is simple but allows staleness; explicit deletion is fast but two-phase; add jitter to TTLs to avoid cache stampedes.