Production at Scale · Walkthrough 03
Read-heavy feed at 100 M+ (Twitter-style, modeled)
The defining asymmetry of a social feed is that reads dwarf writes by two orders of magnitude. Routing each write the wrong way turns that imbalance into a system-threatening wall — and a single celebrity post can trigger more cache writes than your entire infrastructure can absorb.
By the end you'll be able to
- Derive write-amplification numbers for fan-out-on-write and explain where the wall appears.
- Compare fan-out-on-write, fan-out-on-read, and the hybrid strategy — and justify which segment each applies to.
- Size a timeline cache from first principles and identify the hot-key / celebrity problem.
Why reads dominate: the 100-to-1 asymmetry
On a mature social platform, writes (new posts) and reads (timeline loads) are not symmetric. A platform with 100 million daily active users might receive 5 million new posts per day while serving 500 million timeline reads per day — a 100:1 read-to-write ratio. That gap matters architecturally because the engineering cost of a write is paid once, while the cost of making that write readable is paid by every follower who fetches their feed.
The central decision is: when do you pay that fan-out cost? At write time (pre-compute every follower's timeline) or at read time (assemble it on demand)? Both extremes break under different conditions. A hybrid pays each cost only where it is tolerable.
Fan-out-on-write: the write-amplification wall
In fan-out-on-write, each new post triggers a cache insert for every follower of the author. For a typical user with 1,000 followers that is manageable. The wall appears when a celebrity posts.
| Author type | Follower count (F) | Cache writes per post | At 1 post/min rate |
|---|---|---|---|
| Regular user | 1,000 | 1,000 | 1,000 writes/min |
| Micro-influencer | 100,000 | 100,000 | 100 K writes/min |
| Celebrity | 10,000,000 | 10,000,000 | 10 M writes/min |
| Top-tier celebrity | 100,000,000 | 100,000,000 | 100 M writes/min |
A single post from a 100-million-follower account triggers 100 million cache inserts. If each insert takes 0.1 ms, completing the fan-out serially would take 10,000 seconds — nearly three hours. Even with 100 parallel workers each handling 1 M inserts, completion takes 100 seconds. By then, hundreds of millions of followers have already requested the timeline and gotten stale data or cache misses. This is not a theoretical edge case; it is the concrete failure mode that mandates the hybrid strategy.
The ceiling is not just latency — it is also the raw throughput of the cache tier. A Redis cluster handling 500 K writes/second can sustain fan-out for users with up to roughly 500 K followers before a single post saturates it. Beyond that, concurrent celebrity posts create a write storm.
⚠️ Modeled, not measured — illustrative arithmetic
# Fan-out-on-write amplification factor
writes_per_post = follower_count # F
total_writes/day = posts_per_day × avg_follower_count # P × F̄
# Modeled scenario: 100 M DAU platform
posts_per_day = 5_000_000 # 5 M new posts/day (illustrative)
avg_follower_count = 200 # average follows per author (illustrative)
total_writes/day = 5_000_000 × 200 = 1_000_000_000 # 1 B writes/day
writes/sec (avg) = 1_000_000_000 / 86_400 ≈ 11_574
# One celebrity post (100 M followers) spikes this by:
spike_writes = 100_000_000 # 100 M inserts in seconds
# That is ~8,600× the average per-second write rate, absorbed instantly
Interviewers test whether you recognise the asymmetry: "Why not just pre-compute all timelines?" Answer: it works for normal users, but a single celebrity post produces O(follower_count) writes. Name the number — 100 M cache writes for a top-tier account — then pivot to the hybrid as the solution. Quantifying the problem before proposing the fix is what distinguishes a senior answer.
Fan-out-on-read: the read-time assembly cost
Fan-out-on-read stores each post once in a chronological post store. When a user requests their timeline, the system fetches the M accounts they follow and merges the N most recent posts from each — a scatter-gather over M database or cache shards.
| Parameter | Illustrative value | Note |
|---|---|---|
| Accounts followed per user (M) | 300 | Typical median for an active social account |
| Recent posts fetched per followee | 10 | Retrieve to merge; show top 20 |
| DB/cache round-trips per timeline read | 1 (scatter-gather) | Sharded; parallelised |
| Post rows read per timeline fetch | 300 × 10 = 3,000 | Merge overhead at read time |
| Timeline reads/day (100 M DAU, 5 loads/day) | 500,000,000 | Illustrative |
| Total post-store reads/day | 500 M × 3,000 = 1.5 trillion | Entirely impractical without caching |
Pure fan-out-on-read is clearly unworkable at scale: the post-store read amplification is enormous, and the scatter-gather latency compounds with the number of followed accounts. But it has one critical advantage: a celebrity's post costs exactly one write, regardless of how many people follow them.
The hybrid strategy: push for normals, pull for celebrities
The hybrid cuts both amplification problems by segmenting authors:
- Normal users (follower count below threshold T): fan-out-on-write. Their posts are pushed into each follower's timeline cache immediately. Timeline reads are a single cache lookup — fast and cheap.
- Celebrity users (follower count ≥ T): fan-out-on-read, but only for celebrity posts. The celebrity's recent posts live in a dedicated "celebrity post cache" keyed by author id. At timeline read time, the system stitches the pre-built normal timeline (from the cache) with a lightweight fetch of any celebrities the user follows.
⚠️ Modeled, not measured — illustrative threshold arithmetic
# Choosing the celebrity threshold T
# Goal: a single post must not saturate the write tier
# Assume: cache write throughput = 500 K writes/sec
# acceptable fan-out duration = 10 seconds
max_writes = 500_000 × 10 = 5_000_000 # 5 M writes in 10 s
# Therefore T ≈ 1 M followers (fan-out workers × throughput)
# Accounts above T get the celebrity treatment; below T get fan-out-on-write
# Read-time stitch cost for celebrity hybrid
celebs_followed = 5 # typical user follows ~5 celebrities (illustrative)
fetch_per_celeb = 1 # one cache lookup per celeb
stitch_latency_ms = 5 × 1 × 0.5 = 2.5 # <3 ms total — acceptable
Timeline cache sizing from first principles
A timeline cache stores a list of post identifiers (or lightweight post summaries) per user. The goal is to keep the hot users' timelines in memory to serve reads without hitting the post store.
| Parameter | Illustrative value | Justification |
|---|---|---|
| Active users (with cached timeline) | 50,000,000 | 50 M of 100 M DAU have a warm cache entry |
| Timeline entries per user | 500 | ~500 recent post ids; user rarely scrolls further |
| Bytes per entry (post id + timestamp) | 16 bytes | 8-byte int64 post id + 8-byte int64 timestamp |
| Raw timeline data per user | 500 × 16 = 8,000 bytes = 8 KB | |
| Total raw data | 50 M × 8 KB = 400 GB | |
| Redis overhead (hash + pointer) | ~2× | Redis per-key overhead; sorted-set nodes |
| Total cache memory needed | ≈ 800 GB | Illustrative |
| Cache nodes at 64 GB RAM each | 800 / 64 ≈ 13 nodes | With replication: ~26–39 nodes |
A timeline cache should store post identifiers (and perhaps scores/timestamps for ordering), not full post objects. Full post content lives in a separate post cache. This separation means updating a post only invalidates one object, not 50 M timeline cache entries. It also lets the post cache use a different eviction policy tuned to post popularity rather than recency.
The hot-key problem: celebrities as cache thundering herds
Even after adopting the hybrid strategy, a celebrity's post cache entry becomes a hot key: millions of read-time timeline stitches all hit the same cache key for the celebrity's recent posts. A single Redis node owns that key and must handle every read — this is the hot-key problem, and it is the read-side analog of write amplification. See load balancing strategies and the load balancer simulation for techniques including key sharding and local in-process caching.
⚠️ Modeled, not measured — hot-key read estimate
# Suppose celebrity X has 10 M followers, all active at the same time
# Each timeline read fetches the celebrity's post cache entry
timeline_reads_per_min = 10_000_000 / 60 ≈ 167_000
# Equivalent to 167 K Redis GET ops/min on ONE key
# A single Redis node handles ~100 K–1 M simple GETs/sec — so it is manageable,
# but leaves no headroom when multiple celebrities post simultaneously.
# Mitigation 1: local in-process cache (e.g., LRU in each app server)
# Suppose 100 app servers each cache the celebrity post list for 5 seconds
redis_reads = 167_000 / 100 = 1_670 # 100× reduction — well within one node
# Mitigation 2: replicate the hot key to multiple Redis replicas and
# distribute reads across them (read replicas or key sharding)
When a celebrity cache entry expires (or is first populated), millions of requests simultaneously find a cache miss and all race to the post store — a thundering herd. Mitigate with: (1) probabilistic early re-population (start refreshing before expiry with probability proportional to remaining TTL); (2) a distributed lock so only one request populates the cache while others wait; (3) never fully expiring hot keys — refresh in the background on a separate worker thread.
Worked example: a single timeline read, end to end
User Alice (follows 280 normals + 3 celebrities) requests her home feed. The system executes these steps:
- Cache hit check. Look up
timeline:{alice_id}in the Redis sorted set. The set holds 500 post-id/score pairs. HIT: read the top 20 ids. MISS: rebuild from post store (cold path, not shown). - Celebrity stitch. For each of Alice's 3 celebrity follows, fetch
celeb_posts:{celeb_id}from the celebrity post cache — typically 3 round-trips parallelised in a Redis pipeline, returning the last 20 posts per celebrity. - Merge and rank. Merge the normal timeline (20 ids) with up to 60 celebrity post ids; sort all by timestamp descending; take the top 20.
- Post hydration. Batch-fetch the 20 post objects from the post content cache (or post store on miss).
- Return. Serialise and return the 20 posts.
⚠️ Modeled, not measured — latency budget (illustrative)
Step 1 TL cache lookup ≈ 0.5 ms (local Redis, pipeline)
Step 2 3 celeb cache fetches ≈ 1.5 ms (parallel, same Redis cluster)
Step 3 Merge + sort (in memory) ≈ 0.1 ms (20+60 items, trivial)
Step 4 20 post hydrations ≈ 2.0 ms (batch GET, cache hit assumed)
Step 5 Serialisation + response ≈ 0.5 ms
──────────────────────────────────────────────
Total (p50 warm cache) ≈ 4.6 ms (well within a 50 ms SLO)
Cold path (cache miss on TL) ≈ 50–200 ms (DB fan-in + rebuild)
The celebrity stitch makes N Redis calls (one per celebrity followed). Issuing them serially multiplies latency by N. Use a Redis PIPELINE or MGET to batch all lookups in a single round-trip. At 3 celebrities this saves ~1 ms; at 10 celebrities it saves ~5 ms — the difference between meeting and missing a 10 ms SLO.
🧠 Quick check
1. A user with 1,000,000 followers posts a tweet on a pure fan-out-on-write system. How many timeline cache writes does that trigger?
Fan-out-on-write copies the post reference into each follower's timeline cache. With 1 M followers, that is exactly 1 M cache inserts per post — the write amplification factor equals the follower count.
2. The hybrid strategy marks an author as "celebrity" (fan-out-on-read) primarily to:
The celebrity threshold exists to cap write amplification. A celebrity's post writes once to a celebrity post cache instead of fanning out to potentially hundreds of millions of follower caches. The stitch cost at read time is small because any given user follows only a handful of celebrities.
3. A timeline cache stores per-user sorted sets of 500 post ids at 16 bytes each. For 50 M active users, the raw data volume before Redis overhead is:
50 M users × 500 entries × 16 bytes = 400 GB raw. Redis overhead (hash structures, pointers, encoding) roughly doubles this to ~800 GB, requiring a modest cluster of ~13 nodes at 64 GB RAM each, plus replicas for availability.
4. Which technique most directly addresses the hot-key problem on a celebrity post cache entry?
An in-process cache on each app server means each server fetches the celebrity post list from Redis at most once per TTL interval (e.g., every few seconds), multiplying available read throughput by the number of app servers. Switching to fan-out-on-write would recreate the write amplification problem the celebrity strategy was designed to avoid.
✍️ Exercise: design the timeline API for the hybrid system
Design the GET /v1/timeline endpoint for the hybrid feed system described in this lesson. Specify: (a) the request parameters needed for cursor pagination, (b) the response schema (showing how normal posts and celebrity posts appear in the same list), (c) the steps your backend takes to build the response, and (d) what changes when the user has no cached timeline (cold path). Identify any consistency trade-offs.
Model answer
(a) Request: GET /v1/timeline?cursor=<opaque_token>&limit=20. The cursor encodes the timestamp and post-id of the last seen item, allowing the server to page forward without offset-based skew. No page number parameter — live feeds use cursor pagination because new posts shift row positions between requests.
(b) Response schema:
{
"posts": [
{ "id": "p_1234", "author_id": "u_9", "body": "...", "created_at": "2025-06-20T10:00:00Z" }
// celebrity and normal posts interleaved, sorted by created_at desc
],
"next_cursor": "eyJ0IjoxNzUwMDE...", // null if end of feed
"freshness": "cached" // "cached" | "live" — transparency hint
}
(c) Warm-cache steps: (1) fetch timeline:{user_id} sorted set (Redis ZREVRANGEBYSCORE with cursor score); (2) for each celebrity the user follows, pipeline-fetch celeb_posts:{celeb_id}; (3) merge and rank by score/timestamp; (4) batch-hydrate post objects; (5) set next_cursor from the lowest timestamp in the result set.
(d) Cold path: On cache miss, fall back to the post store: query followed users' posts from the DB (scatter-gather), rank, write the result back to the timeline cache for subsequent requests, and respond. This path has higher latency (~50–200 ms vs. ~5 ms warm).
Consistency trade-off: The pre-built normal timeline cache can be up to a few seconds stale — a post from a followed account may not appear in the timeline immediately after posting. This is acceptable (eventually consistent) for a social feed where near-real-time is sufficient. If exact ordering or freshness is required, the cache TTL should be shortened or the endpoint should always stitch from the post store.
Rubric: cursor pagination named and justified (vs offset) — 1 pt; celebrity + normal interleaved in same schema — 1 pt; cache lookup → celeb stitch → merge → hydrate steps listed — 2 pts; cold path identified with latency impact — 1 pt; consistency trade-off stated — 1 pt.
Key takeaways
- Fan-out-on-write produces O(follower_count) cache writes per post. At 100 M followers, that is 100 M writes — a wall that cannot be absorbed in real time.
- Fan-out-on-read avoids write amplification but produces O(followed_accounts × recent_posts) reads per timeline request — unsustainable at large follow counts without caching.
- The hybrid strategy routes normal users through fan-out-on-write (cheap reads) and celebrities through a dedicated post cache stitched at read time (cheap writes). The stitch cost is low because any user follows only a handful of celebrities.
- Timeline cache sizing: ~8 KB per user × 50 M active users ≈ 400 GB raw, ~800 GB with Redis overhead — a modest 13-node cluster before replication.
- Celebrity post cache entries are hot keys. Mitigate with in-process caching per app server and/or Redis read replicas to avoid a single node becoming a bottleneck at millions of reads per second.
Sources & further reading
- Twitter Engineering Blog — infrastructure archive — original public posts on timeline and fan-out design decisions.
- Raffi Krikorian, "Timelines at Scale" (InfoQ/QCon 2012) — the seminal public talk on Twitter's fan-out architecture and the hybrid strategy.
- Caching strategies (this course) — cache sizing, eviction, and hot-key mitigations.
- Load balancing (this course) — hot-key sharding and consistent hashing.
- Load balancer simulation (this course) — interactive hot-key demonstration.
- Design Case Study: Social Feed API (this course) — full system design walkthrough with endpoint shapes.
- Cache simulation (this course) — TTL and eviction behaviour.