Production at Scale · Walkthrough 01
From 1 box to 100M requests/day
Every system starts on a single machine. Scaling is not a single decision — it is a staircase of eight architectural moves, each triggered by a specific metric hitting its ceiling. This walkthrough models the arithmetic, names each bottleneck, and shows exactly which lever to pull.
By the end you'll be able to
- Convert a target request volume into average and peak req/s, and size server counts from first principles.
- Name the eight scaling stages in order, state the metric that forces each move, and describe the architectural fix.
- Identify which simulator or lesson corresponds to each stage and explain how they interact.
Step 0: the arithmetic before anything else
Before drawing a single box, do the back-of-envelope math. "100M requests per day" is a volume target, not a rate — you must convert it to understand what you are actually building for.
System-design interviews and production planning both reward the engineer who grounds every architectural decision in an explicit number, not a vague sense of "lots of traffic."
-- The fundamental unit conversion --
100,000,000 req/day
÷ 86,400 seconds/day
= 1,157 req/s ≈ 1,160 req/s average
-- Peak traffic (diurnal + event spikes) --
Peak multiplier: 3×–5× avg for typical consumer workloads
1,160 × 3 = 3,480 req/s (conservative peak)
1,160 × 5 = 5,800 req/s (spike peak, use this for headroom)
-- Single-server benchmark assumption (illustrative) --
⚠️ Modeled, not measured — actual throughput depends on
workload, language, hardware, and request complexity.
A modest 2-vCPU app server handling lightweight JSON reads:
~200–500 req/s before CPU or DB becomes the bottleneck.
-- Servers needed to serve peak (illustrative) --
Target: 5,800 req/s peak
Per-server capacity: 300 req/s (conservative, DB-bound estimate)
Servers = ⌈5,800 ÷ 300⌉ = 20 app servers
Add 30% headroom → round to 26–28 instances
Every throughput figure in this walkthrough is a first-principles model, not a benchmark from a specific system. Real per-server capacity varies enormously: a Node.js gateway serving cached JSON may handle 5,000 req/s; a Python API doing database joins may cap at 50 req/s on the same hardware. Always load-test your own stack. The models here are for reasoning about order of magnitude and architectural shape, not for precise capacity planning.
The scaling staircase: eight stages
Think of scaling as climbing a staircase in the dark. Each step is invisible until a metric starts climbing: CPU utilisation, query latency, write QPS, or p99 response time. You climb one step at a time — over-engineering ahead of traffic wastes money and adds fragility. Here is what the staircase looks like end-to-end.
Stage-by-stage detail
Stage 1 — single box: the happy path that ends quickly
A single VM or bare-metal server runs the application and the database on the same host. This is the right starting point: it is simple to operate, easy to debug, and cheap. The bottleneck is resource contention: the app and the DB compete for CPU, RAM, and disk I/O on the same machine. Once throughput climbs past a few hundred requests per second for a typical CRUD workload, CPU or disk latency begins climbing steeply — this is the hockey-stick inflection you can see in the capacity & autoscale simulator.
Trigger metric: CPU utilisation > 70% sustained, or DB query latency p99 climbing toward 100 ms.
Stage 2 — split app and database onto separate hosts
Move the database to its own server. Immediately the app has more CPU for request processing, and the DB has the disk throughput it needs. This is the cheapest scaling lever — no new architecture, just a topology change. Throughput ceiling roughly doubles because the two resource pools no longer compete.
Trigger metric: CPU split shows > 50% is consumed by database process while the app process is starved.
Stage 3 — add a caching layer (link: cache simulator)
Most web APIs are read-heavy. If 90% of reads can be served from an in-memory cache (Redis, Memcached), the database sees only the remaining 10% — a 10× reduction in DB load for no change to the DB hardware. That relief headroom is the most impactful single move on the staircase.
-- Cache impact math (illustrative) --
⚠️ Modeled, not measured
Before cache: 1,160 req/s → 1,160 DB reads/s
Cache hit rate: 90%
After cache: 1,160 × (1 − 0.9) = 116 DB reads/s
If DB can handle 1,000 reads/s before latency degrades:
Without cache: headroom = 1,000 − 1,160 = negative (already blown)
With cache: headroom = 1,000 − 116 = +884 reads/s spare
Trigger metric: DB CPU or connection pool saturation despite the app being well under its CPU ceiling — reads are the bottleneck, not compute.
A 90% hit rate is only safe if the cached data is fresh enough for your use case. Over-aggressive TTLs serve stale data; under-aggressive TTLs defeat the cache. Write-through and cache-aside are the two dominant patterns — choose based on how much stale data your users can tolerate.
Stage 4 — load balancer + N stateless app servers (link: scaling strategies, autoscale simulator)
A single app server has a throughput ceiling — there are only so many threads or event-loop cycles per second. The fix is horizontal scaling: put a load balancer in front and add more identical, stateless app servers. "Stateless" is the key word — if session state lives in the server's memory, requests must be routed back to the same server (sticky sessions), which undermines the load balance. Move all session state to a shared store (Redis, the DB, a cookie) first.
-- Server sizing formula --
⚠️ Modeled, not measured
Peak req/s target: 5,800
Per-server throughput (illustrative): 300 req/s
(assumes ~2-vCPU server, mixed read/write, some DB calls)
N = ⌈5,800 ÷ 300⌉ = ⌈19.3⌉ = 20 servers (no headroom)
Apply 30% headroom buffer:
N_final = ⌈20 × 1.3⌉ = 26 servers
With autoscaling: keep 10–12 running at avg load,
scale out to 26 on traffic spikes.
Trigger metric: Single-server CPU exceeds 70% at peak, or p99 latency starts rising linearly with request rate.
Stage 5 — read replicas for read-heavy skew
Most applications have a heavily skewed read:write ratio — often 80–95% reads on a social or content platform. A single DB primary can service this, but not indefinitely. Read replicas receive a streaming copy of the write-ahead log from the primary and serve SELECT queries independently. The primary handles only writes; replicas share the read load.
-- Read/write split math (illustrative) --
⚠️ Modeled, not measured
Total DB queries at peak: 5,800/s (after cache miss)
Read fraction: 90% → 5,220 reads/s
Write fraction: 10% → 580 writes/s
If each replica handles 2,000 reads/s:
Read replicas needed = ⌈5,220 ÷ 2,000⌉ = 3 replicas
Primary handles all writes: 580/s → within typical OLTP limits
Caveat: replicas have replication lag — typically milliseconds on a healthy primary, but up to seconds under load. If your application needs to read-your-own-writes (a user posts a comment and immediately re-fetches the feed), route those reads to the primary. See consistency & CAP for the full story.
Trigger metric: DB CPU or IOPS climbing on the primary; SELECT query throughput approaching the primary's thread-pool limit.
Stage 6 — shard the database when write QPS exceeds one primary (link: scaling the database)
Read replicas solve read scale. Write scale is harder — all writes still funnel through one primary. Sharding partitions the dataset across multiple independent databases (shards), each owning a key range or consistent-hash bucket. Writes are distributed across shards, so write QPS scales horizontally.
-- Sharding headroom estimate (illustrative) --
⚠️ Modeled, not measured
Single-primary write ceiling (illustrative): ~5,000–10,000 writes/s
(depends on row size, index count, fsync policy)
At 100M req/day with 10% writes:
Peak writes = 5,800 × 10% = 580/s ← still OK on one primary
At 1B req/day with 20% writes:
Peak writes = 58,000 × 20% = 11,600/s → needs sharding
With 4 shards: 11,600 ÷ 4 = 2,900 writes/shard → comfortable
Sharding comes with a steep operational cost: cross-shard queries, resharding when growth demands it, and the loss of cross-row transactions across shards. Only shard when the write QPS data shows you have no alternative.
Trigger metric: Primary write latency climbing; write QPS approaching the single-node ceiling; disk IOPS or WAL generation saturating.
Stage 7 — async queues for spikes (link: queue & backpressure simulator)
Some work does not need to be synchronous. Email sends, image resizing, ledger reconciliation, audit log writes — these can be offloaded to a message queue. The API returns immediately after enqueueing; background workers drain the queue at their own pace. This converts a synchronous latency problem into a queue depth metric that you can monitor and scale independently.
The math benefit is twofold: the API server handles more requests per second (because each request is cheaper — it just writes to a queue), and spikes do not cascade into database overload because the queue acts as a shock absorber.
-- Queue depth headroom (illustrative) --
⚠️ Modeled, not measured
Spike: 10,000 email jobs arrive in 60 s
Workers drain at: 200 emails/s (illustrative SMTP rate)
Drain time = 10,000 ÷ 200 = 50 s
Peak queue depth = ~10,000 messages (a few MB in Redis/SQS)
Without queues: 10,000 synchronous SMTP calls in 60 s
→ thread pool exhaustion, API timeouts, user errors
Trigger metric: p99 latency spiking on write-heavy endpoints during traffic bursts; downstream services (email, payment, search-index) becoming the bottleneck for synchronous API paths.
Stage 8 — multi-region for latency and disaster recovery (link: high availability)
Serving users from a single region imposes a physics-bound latency floor: a request from Tokyo to a Virginia data centre and back is ~150–200 ms of round-trip time at the speed of light, even with zero processing time. Deploy to multiple regions to serve users from nearby points of presence, and gain DR failover as a side benefit.
-- Latency reduction from regional placement (illustrative) --
⚠️ Modeled, not measured — actual RTT depends on routing and ISP
Tokyo → us-east-1 (Virginia): ~155 ms base RTT
Tokyo → ap-northeast-1 (Tokyo): ~3 ms base RTT
Improvement: ~150 ms per round-trip
For a UI that makes 5 sequential API calls on page load:
Single-region: 5 × 155 ms = 775 ms wire latency alone
Multi-region: 5 × 3 ms = 15 ms wire latency
Multi-region introduces the hardest problem in distributed systems: data consistency across geographically separated replicas. Active-passive (writes to one region, failover only) is the safest starting point. Active-active (writes accepted anywhere) requires conflict resolution and is operationally complex.
Trigger metric: p50 latency from non-primary regions > 100 ms; RTO/RPO requirements from product or legal that a single region cannot satisfy.
The complete stage → bottleneck → fix → headroom table
The req/s ceilings below are illustrative order-of-magnitude estimates for a typical CRUD API on commodity cloud hardware. Your actual numbers depend on request complexity, payload size, database schema, and hardware generation. Use these to reason about shape, not to set precise provisioning targets.
| Stage | Architecture | Bottleneck that forces the move | Fix | Illustrative ceiling (req/s) | Related lesson / sim |
|---|---|---|---|---|---|
| 1 | Single box (app + DB) | CPU + disk I/O contention between app and DB processes | Separate onto two hosts | ~100–500 ⚠️ | sim-06 |
| 2 | App server + DB server | DB on same host as app limits both | Separate app & DB hosts | ~500–2,000 ⚠️ | rel-14 |
| 3 | App + DB + Cache | DB saturated by repeated identical reads | Cache with 90%+ hit rate; DB sees 10× less load | ~2,000–20,000 ⚠️ | sim-02 |
| 4 | LB + N stateless app servers | Single app server CPU ceiling | Horizontal scale; N = ⌈peak ÷ per-server⌉ × 1.3 | ~20,000–100,000 ⚠️ | sim-06, rel-14 |
| 5 | + Read replicas | DB read QPS saturates primary | Route reads to replicas; primary handles writes only | Reads scale with replica count ⚠️ | rel-15 |
| 6 | + DB sharding | Write QPS exceeds single-primary ceiling | Partition rows across shards by key range or hash | Writes scale with shard count ⚠️ | rel-15 |
| 7 | + Async queues | Synchronous downstream calls inflate p99 during spikes | Offload non-critical work to queue; decouple throughput from downstream rate | API req/s decoupled from worker rate ⚠️ | sim-04 |
| 8 | + Multi-region | Speed-of-light latency penalty for distant users; single-region DR risk | Deploy to 2–3 regions; serve users from nearest PoP | Latency: regional RTT from ~150 ms to ~3 ms ⚠️ | rel-17 |
Architecture diagram: the full stack after all eight stages
"Design a system to handle 100M requests per day" is one of the most common system-design prompts. Strong candidates begin with the arithmetic (1,160 avg req/s → ~5,800 peak), then walk through the staircase in order — naming the bottleneck metric at each step rather than just listing components. Interviewers want to see you tie architectural decisions to numbers, not architecture for architecture's sake. The phrase "the trigger metric for adding read replicas is DB read QPS approaching the primary's thread-pool limit" is the kind of mechanism-level reasoning that separates good answers from great ones.
🧠 Quick check
1. 100M requests/day converts to approximately what average request rate?
100,000,000 ÷ 86,400 = 1,157 req/s, rounded to ~1,160. Peak is typically 3–5× this — so ~3,500–5,800 req/s. This is the first calculation to do in any capacity-planning exercise.
2. A cache with a 90% hit rate reduces database read load by approximately how much?
With a 90% hit rate, 90% of reads are served from the cache and never touch the database. Only 10% of reads reach the DB — a 10× reduction in DB read load for the same request volume.
3. Read replicas solve read scale-out but do NOT help with which problem?
Read replicas fan-out reads across multiple nodes. But they all receive a copy of every write from the primary — so the write throughput ceiling is still bounded by what the single primary can sustain. Sharding is the tool for write scale-out.
4. When is DB sharding the right move?
Sharding is expensive operationally — cross-shard queries, resharding events, loss of cross-row transactions. Introduce it only when write QPS actually forces it. For 100M req/day with 10% writes, peak writes are ~580/s — well within a healthy single primary. Sharding becomes necessary at roughly 10× that volume.
✍️ Exercise: size a system for a new social feature
A product team tells you: "We're launching a photo feed. We expect 50M active users/day, each user makes 4 feed-read requests and 0.5 photo-upload requests per day. Photos are stored in object storage (S3-equivalent) — not in the database. The feed is pre-computed and stored in cache. Assume a 95% cache hit rate for feed reads."
Work through: (a) total requests/day and avg req/s; (b) peak req/s at 4× avg; (c) DB operations/s after cache (reads only, writes are the uploads); (d) minimum read replicas if each handles 1,500 reads/s; (e) whether sharding is needed at launch.
Model answer:
- (a) Total req/day and avg req/s: Feed reads: 50M × 4 = 200M/day. Uploads: 50M × 0.5 = 25M/day. Total: 225M/day. Avg: 225M ÷ 86,400 = ~2,604 req/s.
- (b) Peak req/s at 4×: 2,604 × 4 = ~10,400 req/s. Size your app-server fleet for this number (with 30% headroom: ~13,500 req/s capacity needed).
- (c) DB reads/s after cache: Feed reads/s at peak: 10,400 × (200/225) ≈ 9,244 req/s are feed reads. At 95% cache hit rate: 9,244 × 5% = ~462 DB reads/s at peak. Uploads (writes): 10,400 × (25/225) ≈ 1,156 writes/s at peak.
- (d) Read replicas: 462 reads/s ÷ 1,500 per replica = 0.3 → 1 replica is enough for reads. (The primary handles writes.)
- (e) Sharding at launch: Write QPS is ~1,156/s at peak. A healthy PostgreSQL or MySQL primary can sustain 5,000–10,000 simple writes/s (illustrative). No sharding needed at launch. Monitor as the product grows.
Rubric: Full marks for correct unit conversions, applying the cache hit formula, and a reasoned sharding answer with the write-QPS threshold as the deciding factor. Partial marks if unit conversions are right but the cache math is skipped. Bonus for noting that writes at 1,156/s warrant monitoring write latency as a leading indicator for when to revisit sharding.
Key takeaways
- Start with arithmetic: 100M req/day = ~1,160 avg req/s; peak is 3–5×; size for peak plus 30% headroom.
- Scaling is a staircase of eight stages, each triggered by a specific bottleneck metric — not a choice made upfront.
- Cache before replicating: a 90% hit rate reduces DB load 10×; this is often the highest-leverage single move.
- Read replicas solve read scale; sharding solves write scale. Reach for sharding only when write QPS data shows the single primary is approaching its ceiling.
- Async queues decouple spike absorption from synchronous API latency; they are essential for any path with a slow downstream.
- Multi-region is the last step because it is operationally expensive — but it is the only way to beat speed-of-light latency for geographically distributed users.
- All throughput figures in this lesson are illustrative models. Load-test your own stack before committing to a topology.
Sources & further reading
- Google SRE Book — Production Operations & Capacity Planning chapters
- AWS Well-Architected Framework — Performance Efficiency Pillar
- AWS Well-Architected Framework — overview
- Amazon ElastiCache for Redis — caching patterns
- AWS — Scaling RDS vertically and horizontally
- Werner Vogels — A Word on Scalability (AWS)