API Design

Reliability & Scale · Lesson 14

Scaling from 1 to 1,000,000 users

The architecture graveyard is full of systems that worked perfectly at launch. They weren't killed by bad code — they were killed by the architecture that was right at Stage 1 still running the show at Stage 5.

⏱ 18 min Difficulty: advanced Prereq: Caching, Load Balancing, Event-Driven & Pub/Sub, Capacity Estimation

By the end you'll be able to

Why scale kills systems

The most important insight in this lesson is also the most counter-intuitive: you do not design for one million users on day one. You design a system that can evolve through scale stages without requiring a full rewrite each time. The startup that builds a CQRS + event-sourced + sharded architecture for its first 200 users isn't being forward-thinking — it's writing a system nobody on the team fully understands, that takes three times as long to build, and that has more failure modes than the problem it was built to handle.

Instead, the bottleneck shifts as you grow, and that shift is predictable:

  1. First, you hit CPU limits on the single app server — too many requests, not enough compute.
  2. Then, the database becomes the bottleneck — it's the one component that every app server talks to, and it can't scale by simply adding machines the way app servers can.
  3. Then, the write path saturates — reads can be spread across replicas, but all writes still go through one primary.
  4. Finally, geographic latency dominates — the speed of light becomes your new bottleneck, and no amount of compute fixes a 180 ms round-trip to a data centre on the other side of the planet.

Each of these shifts demands a different architectural response. The skill is recognising which shift you're in — using real metrics, not guesswork — and applying the right technique at the right time.

Vertical vs horizontal scaling

Every scaling decision starts with a fundamental choice: make the machine you have bigger, or add more machines.

Vertical scaling (scale up)

Buy a server with more CPU cores, more RAM, or faster storage. The application needs zero code changes — it simply has more resources. This is the right first move; it's cheap, fast, and introduces no new failure modes. The problem is the ceiling. At some point you run out of machine to buy. AWS's largest general-purpose instance (as of 2025) has 448 vCPUs and 6 TB of RAM. That is genuinely enormous — most systems never need it — but it is a hard ceiling, and the cost curve steepens sharply before you get there.

Horizontal scaling (scale out)

Add more machines and spread the load across them. Instead of one waiter running faster to serve a full restaurant, you hire ten waiters. Each one handles a fraction of the tables, and you can hire more on a busy Friday night and let some go on a slow Tuesday.

Horizontal scaling can extend almost without limit. Cloud providers have enough physical capacity that you will almost certainly exhaust your budget before you exhaust the pool of available instances. The price is architectural: horizontal scaling only works cleanly when each server carries no local state. A server that holds per-user session data in its own memory is "sticky" — users must be routed back to the same server on every request. If that server fails, all of its users lose their sessions. And every new server you add is a potential destination that lacks the session data for half your users.

The enabling move is to externalise state — sessions, caches, uploaded files — to a shared store that every server can reach. Once you do that, every app server becomes interchangeable, and the load balancer is free to send any request to any server. This connects directly to the session-affinity problem covered in the load-balancing lesson and the caching layer discussed in the caching lesson.

Scale Up (Vertical) CPU RAM Disk Simple — no code changes One machine to manage Single point of failure hard ceiling cannot add more Scale Out (Horizontal) Server A stateless Server B stateless Server C stateless + D, E… Shared session store (Redis) Load balancer routes freely Near-linear capacity growth kill any server — users notice nothing
Left: vertical scaling hits a hard hardware ceiling and leaves a single point of failure. Right: stateless horizontal servers can be added or removed at will because no per-user state lives in their local memory.

The architecture progression: seven stages

What follows is a map of the seven stages most web systems pass through on their way from launch to serious scale. Each stage describes what the system looks like, what eventually breaks, and what you add to fix it. The numbers are approximate — your real thresholds depend on request complexity, DB schema, and read/write ratio. Use capacity estimation to know when you are approaching the next boundary.

Stage 1 — 0 to 1,000 users: the single box

Everything lives on one server: the application process, the database, and static file serving. A $20–40/month VPS handles this with room to spare for most CRUD applications. The deploy story is git push and systemctl restart app. The monitoring story is one SSH session.

What breaks: any component failure means a total outage. The app process and DB compete for the same CPU and RAM, so a spike in query load starves the app, and vice versa.

Correct response: nothing. Ship fast, validate your product, and collect real traffic data. Every premature architectural addition at this stage is debt you carry for months before it pays off.

Stage 2 — 1,000 to 10,000 users: split the database

Move the database to its own server. The app server and DB server each get dedicated CPU and RAM. The app can now spike without taking the DB down, and the DB can run autovacuum and index maintenance without starving HTTP handlers. This is also the moment to add automated backups and a basic alerting stack.

What breaks: the app server is still a single point of failure. A deployment takes the whole system down. The DB is starting to feel read pressure from a growing number of users hitting frequently-read rows.

What you add: a separate DB server (the two can talk over a private network interface). Automated nightly backups. A monitoring agent (Datadog, Prometheus + Grafana, or even UptimeRobot for the basics).

Stage 3 — 10,000 to 100,000 users: add a cache

A pattern emerges in the query log: roughly 80% of DB reads are for the same hot rows — user profiles, configuration tables, popular feed items, product catalogue entries. Every one of those reads hits the DB, waits for a disk or buffer-pool lookup, and returns the same bytes it returned a millisecond ago for a different user. A cache absorbs this flood by storing those hot bytes in RAM, where they can be returned in under a millisecond without touching the DB at all. See the caching lesson for the full cache invalidation playbook.

What breaks: write QPS is growing, and the cache adds invalidation complexity — stale reads become a risk. App server CPU begins to peak during traffic spikes because it's handling both web requests and cache-miss DB calls in the same process.

What you add: Redis or Memcached in front of the DB for hot reads. A CDN (Cloudflare, CloudFront, Fastly) for static assets — this alone removes a surprising fraction of origin requests.

Stage 4 — 100,000 users: load balancer + stateless app servers

One app server is now your ceiling for compute and your only failure mode for uptime. The fix is to run multiple app servers behind a load balancer — but this only works if the servers are genuinely stateless. Move session state to Redis (the same instance you likely added in Stage 3). Now the load balancer can route any request to any app server, and you can add more servers without users noticing. See the load-balancing lesson for algorithm choices.

What breaks: the DB is now the single funnel for all data operations. Five stateless app servers can handle five times the request volume, but all of that volume's reads and writes still flow through a single database primary. The DB becomes the new ceiling.

What you add: a load balancer (AWS ALB, NGINX, HAProxy). At least two stateless app servers. Session state moved to Redis. Deployment becomes rolling: take one server out, deploy, put it back, repeat — zero downtime.

Stage 5 — 500,000 users: read replicas

At a typical read:write ratio of 10:1, roughly 90% of the DB queries are reads. Adding 1–3 read replicas — databases that receive a continuous stream of write operations from the primary via replication and can serve read queries independently — lets you distribute 90% of the DB load across multiple machines. Direct all SELECT queries to the replicas; keep all INSERT, UPDATE, and DELETE on the primary. With three replicas, you've effectively multiplied your read capacity by 4× while the primary handles only writes.

What breaks: replication lag. Replicas receive the primary's write stream asynchronously, which means they may be 10–100 ms behind. A user who writes something and immediately reads it back may get stale data if the read hits a replica before the write has propagated. Write QPS is still limited to the single primary — you've solved the read problem but not the write problem.

What you add: read replicas (Postgres streaming replication; MySQL binlog replication). Query routing logic in the application or a proxy (PgBouncer, ProxySQL): writes go to primary host, reads go to replica host(s). Monitor replication lag as a first-class metric — alert at >500 ms.

Stage 6 — 1,000,000+ users: shard writes + async queues

Read replicas cannot help write throughput — all writes still hit one primary. When the primary can no longer absorb write QPS, two approaches run in parallel. First, DB sharding: partition your data across multiple independent primary databases using a shard key (commonly user_id % N → shard N). Each shard handles a fraction of the writes. Second, async queues: identify writes that don't need to be synchronous — sending email, recalculating feed rankings, generating thumbnails — and move them off the request path into a message queue (SQS, Kafka, RabbitMQ). A fleet of background workers drains the queue independently. See the event-driven lesson for queue mechanics.

What breaks: sharding introduces cross-shard complexity. A query that previously joined two tables in one database now potentially joins across shards, which may require an application-level scatter-gather. Analytics queries become expensive. Async queues introduce eventual consistency — callers must accept that some operations complete seconds or minutes after the API call returns.

What you add: horizontal DB sharding, or migrate to a NewSQL database (CockroachDB, PlanetScale, Spanner) that handles sharding transparently. A message queue for async work. Worker processes that consume the queue.

Stage 7 — multi-region

Beyond raw throughput, geographic latency becomes the ceiling. A user in Singapore talking to a data centre in Virginia experiences 180–220 ms of round-trip time before a single byte of application logic runs. The fix is to run a full application stack in each major region and route users to the nearest one via DNS geo-routing (AWS Route 53 latency routing, Cloudflare, etc.). This is the hardest distributed systems problem: how do you keep databases in different continents consistent with each other? CAP theorem guarantees you must trade either consistency or availability when a network partition separates the regions. Most systems choose to write globally and read locally, or use active-active with conflict resolution — a decision that depends deeply on the application's consistency requirements.

What breaks: cross-region writes, split-brain scenarios during network partitions, data sovereignty requirements (GDPR mandates that EU user data stays in the EU), and operational complexity that multiplies with every region you add. This territory is covered in detail in the high-availability lesson.

What you add: DNS-based geo-routing. Per-region app stacks. A global DB layer or a replication strategy (read-local, write-global; or active-active with a CRDT-based conflict resolution strategy).

S1 1K S2 10K S3 100K S4 100K+ S5 500K S6 1M+ S7 multi-region App DB 1 server App DB split App Cache DB + cache LB App App Redis DB primary + LB / stateless LB ×3 App Redis primary ×2 RO + read replicas LB ×n App Queue Shard 0 Shard 1 Shard 2 Workers + shards + queues Region A Full stack DB primary us-east-1 Region B Full stack DB replica ap-southeast-1 DNS geo-route multi-region Cache/Redis Load balancer Read replicas Message queue Workers
Each column is one stage. Components added at that stage appear for the first time with a coloured border. Notice how the DB evolves: a single box → split server → cache in front → primary + read replicas → shards.

Under the hood: why stateless enables horizontal scale

The word "stateless" appears constantly in scaling discussions. Here is exactly what it means mechanically, and why it matters so much.

The stateful problem in detail

A typical stateful app server stores session data in an in-process data structure — a HashMap<userId, SessionData> in the JVM heap, a Python dict in the process's memory, or a module-level variable in a Node.js process. When user 42 logs in, that session object is created and lives in Server A's RAM. The session might look like:

// Server A's in-memory session store (simplified)
sessions = {
  "tok_7f3a9c": {
    userId:    42,
    email:     "ali@example.com",
    role:      "admin",
    loggedInAt:1718000000
  }
}

Request 2 from user 42 arrives. The load balancer uses round-robin and sends it to Server B. Server B has no tok_7f3a9c key in its session map — it was never told about the login. Server B either returns a 401 (user appears logged out) or panics trying to look up a null session.

The "fix" teams reach for is sticky sessions: configure the load balancer to always send user 42 to Server A. This works until Server A crashes or is redeployed. Then all of Server A's sessions are gone, and every user it was serving gets logged out simultaneously. You've also negated the load balancer's ability to actually balance — if half your user base happens to be sticky to one server that's now running a slow query, the other servers sit underutilised.

The stateless solution

A stateless server stores zero per-user data in local memory. The session object lives in Redis — a shared, external store that every app server can read and write. The flow becomes:

  1. User 42 logs in. Server A writes the session to Redis under key sess:tok_7f3a9c. Server A returns the token in a cookie. Server A's own memory holds nothing.
  2. User 42 makes a second request. The load balancer sends it to Server B (round-robin). Server B reads the cookie, extracts tok_7f3a9c, and fetches sess:tok_7f3a9c from Redis in <1 ms. Server B has full session context. User 42 sees no difference.
  3. Server A crashes. Requests that would have gone to A are rerouted to B and C. Redis still has all session data. No users are logged out.
  4. You add Server D. The load balancer routes ~25% of traffic to it. Server D reads sessions from Redis like everyone else. You added a machine with zero application changes.

Worked numeric trace

This is what the difference looks like in concrete numbers at 1,000 requests per second:

--- STATEFUL SERVER A (alone, no horizontal scaling possible) --- Incoming: 1,000 req/s Session size: ~2 KB per active session Active users: 10,000 concurrent Session RAM: 10,000 × 2 KB = 20 MB in heap After 1 hour: session data is still in heap (eviction is manual) Adding Server B: user 42 is sticky to Server A. Server B is empty. Server A fails: all 10,000 active sessions lost simultaneously. --- STATELESS SERVERS A + B (horizontal) --- Incoming: 1,000 req/s split evenly Server A: 500 req/s — zero session RAM Server B: 500 req/s — zero session RAM Redis: 1,000 ops/s at ~0.5 ms each Total throughput: 2× — adding Server C → 3× — Server D → 4× Server A fails: load balancer reroutes to B (+ C, D if present). Active sessions: untouched. Users notice nothing.

The key phrase: statefulness is a scaling ceiling; statelessness is a scaling floor.

Finding and fixing the bottleneck

The cardinal sin in scaling work is adding more of the wrong thing. Adding app servers when the bottleneck is the DB costs money and complexity without improving performance. Adding read replicas when the bottleneck is app-server CPU costs money and introduces replication lag for no gain. The only correct order is: measure first, then fix the component that is actually saturated.

✅ Bottleneck-driven scaling

Only scale the component that metrics show is saturated. Check these four signals in order: (1) app server CPU utilisation, (2) DB connection pool wait time, (3) cache hit rate (target >80%), (4) queue depth (if you have one). The first signal that is red tells you where to act. Scaling everything evenly wastes money and adds failure surface.

Symptom What's saturated Metric to check Fix
High response latency, app CPU near 100% App server compute CPU utilisation, request queue depth Add more stateless app servers (horizontal scale)
Latency spikes, app CPU low DB is the bottleneck DB connection pool wait time, slow query log Add read replicas; cache hot reads; optimise slow queries
Write latency growing as traffic grows DB primary write throughput DB write QPS, replication lag on replicas Shard writes; move expensive writes to async queue
Read latency high despite having replicas Cache miss rate too high Cache hit rate (aim >80%) Increase cache TTL; warm cache on deploy; pre-compute hot keys
Periodic latency spikes at predictable times Batch job contention Job scheduler logs, DB lock wait events Move batch jobs to off-peak hours; use a dedicated read replica for analytics
Response times >200 ms for users in distant regions Network round-trip latency Time-to-first-byte by geographic region CDN for static assets; edge caching for API responses; consider multi-region
🎯 Interview angle

In a system design interview, walk through the architecture progression explicitly rather than jumping straight to the "final" architecture. Say: "At 10,000 users I'd add a cache to absorb hot reads. At 100,000 I'd add a load balancer and make app servers stateless. At 1 million I'd add read replicas and look at moving expensive writes to an async queue. I'd use capacity estimation to know when to move to the next stage." This shows you understand that architecture evolves with scale — not that you over-engineer from day one. Interviewers penalise candidates who immediately propose a fully sharded, multi-region architecture for a system that could run on a $40 VPS.

⚠️ The sticky session trap

A team adds a load balancer but forgets to move session state out of local memory. The load balancer routes requests to different servers, sessions appear to randomly log users out, and the apparent "fix" is to configure sticky sessions — pin each user to one server. Now the load balancer cannot actually balance load. If a popular user's sticky server is under high CPU, that server suffers while others sit idle. One server restart during a deploy logs out every user pinned to it. The correct fix is stateless app servers with a shared session store — not a band-aid on a stateful design.

✅ How to make your app servers stateless in three steps

Step 1: Audit what state your app servers hold locally. Common offenders: session objects in process memory, in-process caches (Node.js module-level Maps, Python module dicts), locally uploaded files stored on the container's filesystem.

Step 2: Move each to a shared external store. Sessions → Redis. In-process caches → Redis with a short TTL. File uploads → S3 or equivalent object storage (never store uploads on the local filesystem of a horizontally scaled server).

Step 3: Verify by deliberately killing one app server mid-session in a staging environment. An active user's next request should complete normally against a different server with zero indication that anything changed. If it doesn't, you have undiscovered local state.

▶ See it live

Play with the capacity / autoscale simulator — drag the load toward 100M req/s and watch this behaviour in real time.

🧠 Quick check

1. What is the primary reason stateless app servers can scale horizontally while stateful ones cannot?

The key constraint with stateful servers is routing: the load balancer must send a user back to the exact server that holds their session. With stateless servers, any server can serve any user by reading shared state (Redis) in milliseconds — the load balancer is free to route however it likes, and new servers join the pool instantly.

2. At which stage does the database typically become the main bottleneck?

At Stage 4 you've multiplied your compute capacity (multiple app servers) but all of that capacity still bottlenecks at one database. Five app servers can generate five times the DB traffic; the DB primary absorbs every write and every cache-miss read from all of them.

3. A system has a read:write ratio of 50:1. Which technique provides the most immediate scaling relief for the read path?

At 50:1, reads are 98% of traffic. Read replicas let you spread those reads across multiple machines. Sharding helps the write path (2% of traffic) — the wrong tool here. gRPC changes protocol overhead, not DB capacity.

4. A startup at 5,000 users has a CTO who wants to implement database sharding "to prepare for scale." What is the most accurate assessment?

At 5,000 users, a well-tuned Postgres instance on a $50/month server handles the load with headroom. Sharding is a Stage 6 solution for Stage 6 problems. Introducing it at Stage 1 means carrying cross-shard join complexity, migration risk, and operational overhead through every stage that follows — for years, with no benefit.

✍️ Exercise: scale a photo-sharing app one stage further

Scenario: A photo-sharing app currently runs at Stage 4 — one load balancer, four stateless app servers, one DB primary, and a Redis cache. Capacity estimation shows it has grown to 800,000 daily active users. Symptoms observed this week:

Your task: (a) Identify the bottleneck. (b) Describe the correct next architectural change. (c) What could go wrong with that change, and how would you mitigate it?

Model answer:

(a) Bottleneck: the DB primary. High DB CPU (90%) with low app server CPU (30%) and a healthy cache hit rate (85%) is a clear signal: the app layer and cache layer have headroom; the DB does not. The slow query log confirming mostly SELECT statements tells us the read path is saturated, not writes.

(b) Correct change: add 2–3 read replicas. Route all SELECT queries to replicas; keep all writes on the primary. With a typical photo-sharing read:write ratio of 20:1 or higher, this distributes roughly 95% of DB query volume across multiple machines, dropping DB CPU to a manageable level per node. This is cheaper and less complex than sharding (a write-path solution, wrong tool here) and more targeted than adding more app servers (which would increase DB load further).

(c) Risk: replication lag. Replicas receive writes asynchronously and may be 10–100 ms behind the primary. For a photo-sharing app, two scenarios matter:

Monitor replication lag as a first-class metric. Set an alert at >500 ms lag. A lagging replica that falls minutes behind can serve severely stale data; at that point the correct action is to stop routing reads to it until it catches up.

Rubric: Full marks for (a) correctly naming DB as bottleneck (not cache or app layer — the metrics rule those out), (b) recommending read replicas rather than sharding (wrong fix for a read-path problem), and (c) specifically addressing replication lag with a read-your-writes mitigation strategy. Partial marks for correct bottleneck identification but wrong fix. Bonus for mentioning a replication lag alert threshold.

Key takeaways

Sources & further reading