Reliability & Scale · Lesson 15
Scaling the database
Every other tier in your stack scales horizontally on day one — but the database is the one node that knows where everything lives. This lesson is about removing that bottleneck without trading away the reliability you need.
By the end you'll be able to
- Route reads and writes correctly with replicas, and explain the stale-read window that replication lag creates.
- Choose a shard key, trace a key through consistent hashing to a physical node, and explain why adding a node with naive modulo hashing causes mass data migration.
- Apply CQRS, denormalization, indexing, and connection pooling as targeted interventions at the right layer of the stack.
The library with one librarian
Imagine a library where a single librarian handles every request: shelving, retrieval, cataloguing, and reference questions — all on one pair of hands. Traffic is fine at first. But as the library grows, that librarian becomes the ceiling on every reader's experience. You can hire more librarians (replicas) for lookups while the head librarian stays in charge of updates. You can split the collection across wings (shards) so each librarian only knows their section. You can pre-build subject indexes (indexes) so nobody reads every shelf. And you can let people queue at a service desk (connection pool) instead of crowding the counter directly. Each of those moves has real costs — this lesson maps all of them.
Read replicas
The simplest lever: keep one primary that accepts writes, and run one or more read replicas that stream a copy of every change from the primary's write-ahead log. Your application routes all SELECT queries to the replicas and all INSERT / UPDATE / DELETE queries to the primary.
The cost is replication lag. The replica applies changes asynchronously, which means there is a window — typically a few milliseconds on a well-tuned setup, but up to seconds under write storms — where a replica returns data that the primary has already overwritten. A user who just submitted a form and immediately reloads the page can see the old data. This is not a bug you can fix in application code without re-routing that specific read to the primary. The section on read-after-write routing below covers the pattern.
Replicas help with read-heavy workloads — reporting queries, dashboards, full-text search — but they do nothing for write throughput. Once your primary can't absorb writes fast enough, you need sharding.
Sharding the data
A single machine has a ceiling: RAM, CPU, and disk IOPS are finite. Sharding horizontally partitions the data across multiple independent nodes so that each node owns a fraction of the rows. A shard is just a database node that holds a contiguous or algorithmically determined slice of the keyspace.
Choosing a shard key
The shard key is the field you use to decide which node owns a given row. Pick badly and you pay for it forever. Good shard keys are:
- High cardinality — enough distinct values to spread rows evenly.
- Stable — changing the shard key of an existing row requires moving it, which is expensive.
- Co-located with access patterns — most queries for a given user should hit one shard, not scatter across all of them (scatter-gather is slow and hard to paginate).
Hash sharding vs range sharding
Hash sharding puts the key through a hash function and maps the result to a shard (e.g. hash(user_id) % N). Distribution is even; range queries are expensive because adjacent IDs land on different shards. Range sharding assigns ordered ranges of keys to shards (e.g. users A–M on shard 1, N–Z on shard 2). Range queries are fast; hot ranges (think: all new users arriving in alphabetical order) create hot shards.
The hot-key / celebrity problem
Hash sharding distributes traffic evenly for uniform access patterns. Real workloads are never uniform. A social network where Beyoncé has 200 million followers will route every one of those followers' timeline reads to the single shard that owns Beyoncé's posts. No amount of horizontal scaling helps if every request still hits one shard. See the callout below for the mitigation patterns.
Why naïve resharding hurts
With modulo sharding (hash(key) % N), adding a fourth shard changes N from 3 to 4 and invalidates the placement of almost every key. If hash(key) % 3 = 2, there is no guarantee that hash(key) % 4 = 2 — in fact roughly 75% of keys land on a different shard after the change, forcing a massive data migration while the system must still serve traffic. Consistent hashing (covered in Lesson 08) solves this: nodes and keys are placed on a ring, and adding a node only moves the keys that sit between the new node and its predecessor — on average 1/N of the total keyspace.
Denormalization, indexing, and CQRS
Denormalization for read-heavy APIs
Normalization eliminates redundancy and makes writes clean, but a fully normalized schema often forces the database to join five tables to answer one API request. At high read volume those joins become the bottleneck. Denormalization strategically duplicates data in a pre-joined shape optimized for the most frequent read pattern. A posts table that redundantly stores author_name alongside author_id means the API can satisfy a timeline request with a single table scan, no join required. The cost: writes must maintain two copies. Acceptable when reads outnumber writes by 10:1 or more; dangerous when write consistency across copies is hard to enforce.
Indexing strategies
An index is a pre-sorted copy of a subset of your data that lets the database skip the full-table scan. Getting indexing wrong costs more than getting it right gains:
- B-tree indexes are the default and cover equality lookups and range queries (
>,BETWEEN,ORDER BY). Almost every primary key and foreign key should have one. - Composite indexes cover multi-column predicates. Column order matters — put the most selective column first, and the index can only be used left-to-right. An index on
(user_id, created_at)acceleratesWHERE user_id = 42 ORDER BY created_at DESCbut notWHERE created_at > Xalone. - Partial indexes index only rows matching a condition (e.g.
WHERE status = 'active'). Narrow, fast, cheap to maintain. - Too many indexes slow writes — every
INSERTorUPDATEmust maintain each index. Audit unused indexes regularly (pg_stat_user_indexesin Postgres tracks actual usage).
CQRS: separating read and write models
Command Query Responsibility Segregation takes the denormalization idea to its logical conclusion: maintain a fully separate data model optimized for reads, updated asynchronously whenever a write (command) is processed. The write side uses a normalized schema tuned for consistency; the read side uses whatever pre-joined, pre-aggregated structure makes each query fast. An event — "post created" — flows from the write model to a background consumer that updates the read model. The tradeoff is explicit eventual consistency: the read model lags behind the write model by the consumer's processing latency. CQRS is overkill until you have genuinely incompatible read and write access patterns that no single schema can serve well.
Connection pooling
Opening a TCP connection to a database is not free: handshake, authentication, and session setup add 5–20 ms and consume a server-side slot. At high concurrency, thousands of threads each trying to open their own direct connection creates two problems: connection setup latency on the hot path, and the database process count ceiling (Postgres, for instance, spawns a process per connection and starts thrashing once you exceed a few hundred).
A connection pool maintains a fixed set of long-lived connections and lends them to application threads on demand. The application checks out a connection, executes a query, and returns the connection to the pool. If the pool is empty, the thread waits. The pool size should be tuned to the database's actual concurrency capacity, not to the application's thread count — a pool of 20 connections serving 500 application threads is normal; a pool of 500 connections to match thread count will overwhelm the database.
Common poolers: PgBouncer and pgpool-II for Postgres; HikariCP for JVM applications; the built-in pool in most ORM frameworks. Transaction-mode pooling (connections returned immediately after each transaction) gives the best multiplexing ratio but breaks session-level features like advisory locks and SET LOCAL.
Under the hood: how it actually works
Trace A — shard routing for user_id=42
Here is the exact arithmetic when your application looks up user_id=42 with three shards.
Trace B — replication lag timeline
A user updates their profile name. Here is what happens across the primary and replica at millisecond resolution.
How to debug & inspect it
Database performance problems rarely announce themselves. Most manifest as slow API response times with no obvious application code to blame. Start with the query planner.
The symptom–cause–fix table for the problems you will encounter most often:
| Symptom | Likely cause | Fix |
|---|---|---|
| User sees stale data immediately after a write | Read routed to replica before WAL applied | Route reads to primary for 1–2 s after any write in that user session; or use synchronous replication for critical paths |
| One shard at 90% CPU while others idle | Hot shard — shard key attracts disproportionate traffic (e.g. celebrity account) | Add virtual nodes for the hot key; cache reads at the application layer; store popular rows on all shards; consider a read-through cache in front |
| Replica falls progressively behind primary | Replica apply speed can't keep pace with primary write throughput | Upgrade replica IO; tune max_wal_size and checkpoint_completion_target; reduce write amplification; consider synchronous commit on critical paths |
| Query suddenly 100× slower than baseline | EXPLAIN shows Seq Scan — index not used or doesn't exist | Create composite index matching the WHERE + ORDER BY pattern; run ANALYZE if statistics are stale; check whether planner is choosing wrong index |
| API latency spikes at high concurrency; threads queue behind DB calls | Connection pool exhausted — threads waiting for a free connection | Reduce pool checkout wait time; right-size pool (not too large); move long-running queries to async workers; add a second read pool hitting replicas |
This is the most common system-design escalation prompt. The answer has a deliberate order: (1) measure first — replica reads, query optimization, and indexes solve most problems without restructuring anything; (2) if write throughput is the ceiling, introduce sharding with a shard key chosen for access-pattern affinity; (3) choose consistent hashing to avoid resharding pain later; (4) add CQRS only if read and write models are genuinely incompatible. State the tradeoffs explicitly: sharding complicates cross-shard queries and transactions; replicas introduce eventual consistency; denormalization requires careful write-path discipline. Interviewers reward candidates who articulate what they are giving up, not just what they are gaining.
Sharding by user_id distributes data evenly in theory. In practice a social network with 10 users who each have 50 million followers has a celebrity problem: every follow-graph traversal, every timeline fanout, and every notification write converges on the shards that own those 10 accounts. The shard is not overloaded because it holds too much data — it is overloaded because it receives too many requests per second. Three mitigations: read caching (put a Memcached or Redis layer in front of the celebrity shard so most reads never reach the DB); write fan-out (pre-write celebrity posts to every follower's timeline at write time, so reads scatter across follower shards); and shard splitting (allocate a dedicated shard to the single hot key). None of these is free — caching introduces staleness, write fan-out is expensive on a 50M-follower post, and a dedicated shard is operationally complex. Pick the mitigation that fits your consistency and latency requirements.
When a user writes data and immediately reads it back, route that read to the primary. The simplest implementation: after any mutating request in a session, set a short-lived cookie or header flag (e.g. X-Read-Primary: 1 with a 2-second TTL). Your load balancer or data-access layer reads that flag and directs the next requests to the primary. After the TTL the flag expires and reads return to replicas. Two seconds is almost always longer than replication lag in a healthy cluster — you pay primary read cost only when it matters.
By the numbers
Make it concrete. The platform is a social-media API that has grown to 200,000 write QPS on its primary database. Each shard is a managed Postgres node tuned to handle 25,000 write QPS before CPU and I/O become the bottleneck (Vitess operational guides use a similar per-shard ceiling). Reads run through replicas at a 4:1 read-to-write ratio.
How many shards do you need?
shards = ⌈total_write_QPS ÷ per_shard_QPS⌉
= ⌈200,000 ÷ 25,000⌉
= ⌈8.0⌉
= 8 shards
Eight shards each at capacity. The moment write QPS creeps to 200,001 you need a ninth, so in practice you provision 10 and keep 80% utilisation as your ceiling — that headroom absorbs burst and gives you time to add shards without an emergency.
Replication-lag staleness window
The primary commits a write at t = 0. The WAL record travels over TCP and the replica applies it at roughly t = +40 ms under normal load (PostgreSQL streaming-replication docs show typical apply lag of 10–100 ms on a busy primary). Any read that arrives at the replica in the window (0, 40 ms) returns stale data.
| t (ms) | Primary state | Replica state | Result of reading replica |
|---|---|---|---|
| 0 | COMMIT — WAL flushed | WAL not yet received | STALE — old value |
| 12 | WAL streamed over TCP | WAL received, not yet applied | STALE — old value |
| 30 | Normal operation | Applying WAL record | STALE — old value |
| 40 | Normal operation | WAL applied | FRESH — new value |
| 60+ | Normal operation | Fully caught up | FRESH — new value |
Staleness probability: if inter-write interval for a given row is 500 ms, then P(stale read) ≈ 40 ms ÷ 500 ms = 8%. For a "post just published" read, the inter-write interval might be seconds, dropping staleness risk to under 1%.
Hot-key imbalance: the celebrity shard
Suppose one key — a celebrity account — attracts 15% of all traffic. With 8 shards and uniform distribution you'd expect each shard to receive 200,000 ÷ 8 = 25,000 QPS. But the hot key concentrates all 15% on shard 0:
Decision math: when to add shards and how many read replicas to provision
Add shards when: current_write_QPS > target_utilisation × per_shard_QPS × shard_count. At 80% target: add when current_write_QPS > 0.80 × 25,000 × 8 = 160,000 QPS. You crossed that threshold at 160,001 QPS — at that point you have time to add shards before hitting the ceiling at 200,000.
Read replicas per shard: reads run at 4× writes = 800,000 read QPS total across 8 shards = 100,000 read QPS per shard. A replica handles the same ~25,000 QPS as a primary, so each shard needs ⌈100,000 ÷ 25,000⌉ = 4 read replicas. Total fleet: 8 primaries + 32 replicas = 40 Postgres nodes. Caching even 50% of reads cuts that to 2 replicas per shard, halving the replica count.
🧠 Quick check
1. Routing writes to the primary and reads to replicas does NOT protect against:
Read-replica routing does exactly nothing about replication lag — it is the cause of stale reads, not the solution. A user who writes and immediately reads can see the old value if the replica hasn't applied the WAL yet. Write contention and network partitions are separate problems that require different tools.
2. In consistent hashing, when you add a new node to a ring with N existing nodes, what fraction of keys must move on average?
Adding one node to a ring of N nodes gives N+1 nodes. The new node takes ownership of the arc between itself and its clockwise predecessor — which contains roughly 1/(N+1) of the keyspace. Every other node's range is unchanged. This is the property that makes consistent hashing worth the extra complexity over modulo sharding.
3. Which pattern separates the read model from the write model at the application layer, maintaining them as two independent data stores?
CQRS (Command Query Responsibility Segregation) is the only one of the three that explicitly separates the data models, not just the query routing. The write model processes commands and emits events; a separate consumer updates the read model optimized for queries. Denormalization and replicas are implementation details that can live inside either model; CQRS is an architectural split above them.
4. Why does connection pool exhaustion cause cascading failures rather than just slower requests?
When all pool slots are taken, application threads block waiting for a free connection. Those threads still hold their HTTP request handlers, consuming thread-pool slots. Once the HTTP thread pool fills too, all incoming requests start queueing even if they don't need the database at all. A single slow query can cascade into a full application outage. The fix is to set a short pool checkout timeout and circuit-break or shed load rather than letting threads pile up.
✍️ Exercise: shard the social network's posts table
You are designing the sharding scheme for a social network's posts table. The system has 500 million posts, 10 million active users, and a long tail of celebrities with up to 100 million followers each. Design the following and justify each choice:
- What shard key do you choose for the
poststable? Why? - How do you handle reads and writes for celebrity posts without overloading a single shard?
- What is your resharding plan when you need to add capacity six months from now?
Model answer
Shard key: Shard by user_id. All posts by a given user land on one shard, which makes fetching a user's own timeline a single-shard query — no scatter-gather. Posts for the home timeline (posts by all users you follow) require scatter-gather anyway, so optimizing for self-profile is the right call. Avoid sharding by post_id: it spreads posts evenly but forces every user-profile read to fan out to all shards.
Celebrity posts: Apply a two-layer strategy. First, put a distributed cache (Redis) in front of celebrity shard reads — a celebrity's most recent 1000 posts live in cache with a 30-second TTL; 99%+ of follower reads hit the cache. Second, use write fan-out for new celebrity posts: when Beyoncé posts, a background job writes the post ID to the "inbox" of her top 1000 most-active followers so their home timelines can be served from the follower shard without touching the celebrity shard. For the remaining millions of followers, use a pull model at read time (the old behavior), relying on the cache to absorb the spike.
Resharding plan: Use consistent hashing from day one. When you add capacity: (1) announce the new node's position on the ring; (2) migrate only the key range between the new node and its predecessor in the background — keep the old shard serving those keys until migration is complete; (3) flip routing once the new shard is confirmed current; (4) remove the migrated range from the old shard. Use double-write during migration (write to both old and new node) to prevent data loss if the process is interrupted. Never do a big-bang modulo reshard — it requires taking a write freeze across the entire cluster.
Rubric: Full marks for all three parts with justification. Key signals: choosing user_id not post_id and explaining why; naming cache + fan-out as separate mitigations for the celebrity problem; specifying gradual migration (not big-bang) and double-write during the resharding window. Bonus: mentioning that fan-out breaks down for celebrities over ~10M followers and describing the hybrid push/pull threshold.
Key takeaways
- Read replicas scale read throughput cheaply but introduce replication lag. Route reads back to the primary for 1–2 s after any write in the same session to prevent stale-read surprises.
- Consistent hashing is the right default for sharding: adding a node moves only 1/N of the keyspace instead of nearly all of it. Modulo sharding is simple but requires a full-cluster migration to resize.
- The hot-key problem is orthogonal to shard count — no amount of horizontal scaling helps if one shard receives all the traffic. Cache hot keys in front of the database, and consider fan-out writes for celebrity content.
- CQRS is the right tool when read and write access patterns are genuinely incompatible — not a default. Reach for it after indexing, denormalization, and replicas have been tried and found insufficient.
- Connection pool exhaustion cascades: a full pool blocks application threads that then block the HTTP thread pool, turning a database bottleneck into a site-wide outage. Size pools for the database's capacity, not the application's thread count, and set short checkout timeouts.