Reliability & Scale · Lesson 16
Consistency, CAP & correctness at scale
When three nodes disagree about the value of a key, your system has to pick a side — and the side you pick defines the contract your API makes with every caller. This lesson gives you the vocabulary, the math, and the practical wiring to make that choice deliberately.
By the end you'll be able to
- State the CAP theorem precisely and explain why a real network partition forces you to choose between consistency and availability — not just balance them.
- Calculate quorum requirements for N nodes and explain why R + W > N guarantees strong consistency.
- Design a read-after-write strategy that prevents users from seeing stale data immediately after their own writes.
When nodes stop agreeing
Imagine a fleet of three database servers spread across two data centres. A fibre cut splits them: one side can reach the other side but messages are dropped. For a few seconds — or minutes — the two halves diverge. A user writes a new profile photo on the left side; a second user immediately requests that profile from the right side. What does your API return?
That is not an edge case. It is the defining question of distributed data systems, and every design decision you make about replication, routing, and caching is an answer to it — whether you realised it at the time or not. The CAP theorem makes the trade-off explicit.
The CAP theorem
In 2000, Eric Brewer conjectured — and in 2002 Gilbert and Lynch proved — that a distributed data store can satisfy at most two of three properties simultaneously:
- Consistency (C): every read receives the most recent write or an error. No node serves stale data.
- Availability (A): every request receives a non-error response, even if it may not reflect the latest write.
- Partition tolerance (P): the system continues operating even when network messages between nodes are lost or delayed arbitrarily.
Because network partitions do happen in any real distributed system — link failures, packet loss, asymmetric routing — partition tolerance is not optional. You must tolerate it. That reduces the real choice to: during a partition, do you keep serving possibly-stale data (AP) or refuse requests until the partition heals (CP)?
CP systems prioritise correctness. During a partition, a CP database returns an error rather than a stale value. The caller gets a clear signal: "I can't give you a reliable answer right now." HBase, ZooKeeper, and Google Spanner are CP. Your API returns 503 rather than wrong data.
AP systems prioritise uptime. During a partition, an AP database serves whatever value it currently holds, even if that value is seconds or minutes old. Cassandra and DynamoDB default to AP. Your API returns a response — potentially a stale one.
Neither is universally better. The choice belongs to the system's requirements.
PACELC: the full picture
CAP only describes behaviour under a partition. Daniel Abadi pointed out in 2012 that this leaves out the 99.9% of the time when there is no partition. Even then, you face a trade-off: latency versus consistency.
PACELC stands for: if there is a Partition, choose between Availability and Consistency; Else (normally), choose between Latency and Consistency.
| System | Under partition (CA) | Normal operation (LC) | PACELC label |
|---|---|---|---|
| DynamoDB (default) | Availability — responds with possibly-stale data | Latency — eventually consistent reads are faster | PA / EL |
| DynamoDB (strong consistency) | Availability | Consistency — extra latency for quorum reads | PA / EC |
| Cassandra (ONE read) | Availability | Latency — reads from one replica | PA / EL |
| Cassandra (QUORUM read) | Availability | Consistency — reads from majority | PA / EC |
| HBase / etcd | Consistency — rejects requests during partition | Consistency — always linearisable | PC / EC |
| MySQL async replication | Availability | Latency — replica reads may lag | PA / EL |
PACELC makes the design conversation more honest. When a colleague says "we need high availability and strong consistency," PACELC forces the question: at what read latency cost? Strong consistency usually means at least one extra network round-trip per read (the quorum acknowledgment), or routing all reads to the primary. That cost is real and must be budgeted.
Strong consistency and eventual consistency
Think of a flight departure board. A strongly consistent board shows the current gate every time — every display, everywhere in the terminal, shows the same value right now, or it shows nothing at all. An eventually consistent board might show Gate 12 on one screen and Gate 14 on another for 30 seconds after a gate change, then converge.
Strong consistency (also called linearisability) means: once a write is acknowledged, any subsequent read — from any node, from any client — sees that write or something newer. There is a single, total order of operations. From a caller's perspective, the distributed system behaves like a single machine.
Eventual consistency means: if no new writes arrive, all replicas will eventually converge to the same value. During the convergence window, different replicas may return different answers. Your API may serve stale data. The system makes a liveness guarantee (it will converge) but not a safety guarantee (it will be current right now).
| Property | Strong consistency | Eventual consistency |
|---|---|---|
| Read reflects latest write? | Always, from any node | Only after propagation delay |
| Read latency | Higher — quorum or primary required | Lower — any replica can respond |
| Availability during partition | Reduced — some nodes refuse | Full — every node responds |
| Good fit for | Bank balances, inventory counts, configuration, tickets | Social feeds, view counts, shopping-cart drafts, DNS |
| Failure mode | Returns an error; never lies | Returns an answer; might be wrong |
Eventual consistency says convergence will happen — it does not say when. Propagation lag can be 50 ms, 5 seconds, or permanent if you hit a split-brain scenario where two primaries accept conflicting writes. Design for the worst realistic delay your replication topology can produce, not the average.
Quorum: the math behind consistency
Quorum is the mechanism that lets you get strong consistency out of an eventually consistent cluster — without routing everything to a single primary. The key insight: if a write touches enough nodes, and a read touches enough nodes, the read set and write set must overlap. Overlap means the read will always hit at least one node that has the latest write.
The formula: R + W > N, where N = total replica count, W = write quorum (nodes that must acknowledge a write), R = read quorum (nodes that must respond to a read). When this holds, the overlap is guaranteed: overlap = R + W − N ≥ 1.
Worked example with N=3:
- W=2, R=2: R+W = 4 > N=3. Overlap = 4−3 = 1 node guaranteed in common. Strong consistency.
- W=3, R=1: R+W = 4 > 3. Writes to all nodes, reads from one. Strong consistency, but writes are slow.
- W=1, R=3: R+W = 4 > 3. Fast writes, reads from all. Strong consistency, but reads are slow.
- W=1, R=1: R+W = 2 = N. No guaranteed overlap. Eventual consistency only.
In practice, the coordinator (the node the client contacts) sends the write to W nodes, waits for W acknowledgments, then returns success to the client. On a read, it contacts R nodes, collects R responses, and returns the value with the highest version number or timestamp. No single point of failure, no single-primary routing required — the math guarantees correctness.
Read-your-writes and monotonic reads
Two consistency guarantees are lower than strong consistency but matter enormously for user experience:
Read-your-writes consistency guarantees that after you write, you see your own write when you read — from the same session, even if other clients might not see it yet. Without this, you submit a form, the page reloads, and your change has vanished. This is the minimum bar for anything user-facing. It does not require all clients to be consistent, only that the session that performed the write sees it.
Monotonic reads guarantee that once you read a value at version V, you will never subsequently read a version earlier than V. Without monotonic reads, a user could refresh a page and see an older version of data than they saw one refresh ago — as if time ran backwards. This happens when consecutive requests land on different replicas at different replication offsets.
Both of these are session-level guarantees. They are weaker than strong consistency (different clients may still see different values) but much stronger than bare eventual consistency.
Handling stale data in practice
Replica lag (covered in the previous lesson) is the primary source of stale reads. When a replica is 200 ms behind the primary, any read routed to that replica may return data that is up to 200 ms old. Here are the techniques engineers use to handle this in API design:
| Technique | How it works | Cost |
|---|---|---|
| Route read-after-write to primary | After a write, route the next N reads (or reads within the next T seconds) from the same session to the primary. Once the window passes, fall back to replicas. | Slightly higher read load on primary; requires session state to track recency. |
| Min-version / replication token | The write response includes a min_version or write_fence token. The client sends this token with subsequent reads. The server rejects the read from any replica that has not yet applied that version and retries from a more up-to-date one. |
Adds a header to the API contract; requires replicas to expose their replication offset. |
| Wait-for-replication option | The write request accepts a parameter like consistency=quorum that blocks until the specified number of replicas acknowledge the write before returning. The client then knows all subsequent quorum reads will see the write. |
Increases write latency proportionally; must be opt-in to avoid penalising all writes. |
| ETag / If-None-Match | Responses carry an ETag representing the data version. A client that holds a recent ETag can send If-None-Match; a replica that hasn't applied the latest version returns 412 Precondition Failed rather than a stale 200. |
Requires the storage layer to version data and expose ETags; adds conditional-request logic to the API. |
| Accept stale, expose it | For non-critical reads, serve stale data but include a header like X-Data-Version or X-Replication-Lag-Ms so callers can reason about freshness themselves. |
Shifts the correctness burden to the caller; appropriate for social feeds, analytics, search results. |
For most web APIs, the simplest read-after-write implementation is: tag the user session with the timestamp of their most recent write. For the next 2 seconds after a write, route all reads from that session to the primary. After 2 seconds, fall back to replicas. This requires one session store key and two routing rules. It handles the user who hits refresh immediately after saving without penalising steady-state read traffic. Include a min_version token in write responses for clients that need a stronger guarantee than the timeout-based window.
Under the hood: how it actually works
Trace A: quorum read returning the correct value (N=3, W=2, R=2)
Client writes user_id=42, balance=100. The coordinator sends the write to nodes A and B. Node C is temporarily slow and misses the write. Client then reads immediately — the coordinator picks B and C as its read quorum.
Trace B: eventual consistency — a second client reads stale data
Same cluster. This time the write uses W=1 (no quorum). A second client reads from a replica that hasn't caught up yet.
How to debug & inspect it
In PostgreSQL, replication lag is directly observable. The primary records each replica's replication state in pg_stat_replication. The replay_lag column tells you how far behind each standby is — combine this with the min_version token from your write response to decide whether a given replica is safe to read from.
| Symptom | Likely cause | Fix |
|---|---|---|
| User wrote data then didn't see their change | Read-after-write routed to a lagging replica | Route post-write reads to primary for a short window; check replay_lag in pg_stat_replication |
| Two users see different values for the same key simultaneously | Requests hitting replicas at different replication offsets; or quorum too low (R+W ≤ N) | Increase read or write quorum so R+W > N; or use sticky session routing |
| Read returns old data even after write was confirmed | Write quorum was met, but read quorum didn't include an up-to-date node; or replica lag exceeded read timeout | Verify R+W > N; add min_version token to read requests; fall back to primary if token check fails |
| Primary refuses write during partition | CP behaviour — system correctly prioritises consistency over availability | Expected in a CP system; return 503 with Retry-After; document this as a known failure mode in API contracts |
| Replica serves stale data after network heals | Replica was partitioned and accumulated a write backlog; replication is catching up | Monitor replay_lag post-healing; exclude the replica from the read pool until lag drops below threshold; alert if lag exceeds SLO |
When a system design question asks you to design a data layer, name the consistency model you are choosing — and justify it. "I'll use eventual consistency" is incomplete. Say: "I'll use eventual consistency for the activity feed because users tolerate a few seconds of delay and the extra replica availability matters more than guaranteed freshness. I'll use strong consistency for the account balance because serving a stale balance and then rejecting a payment is worse than a momentary 503." Examiners are listening for whether you treat consistency as a deliberate architectural trade-off, not a default you accept without thought. What would you choose for a social feed vs a bank balance?
Include a X-Write-Version (or min_version) field in write response bodies. The value is the replication sequence number or logical clock value that the write advanced to. Clients that need read-after-write semantics send this back as a request header; your read path rejects any replica whose replication offset is below that value and retries elsewhere. This gives clients opt-in strong read-after-write without routing all reads to the primary permanently.
By the numbers
Make the math concrete. Scenario: an order-status API backed by a 3-node Cassandra cluster. Writes arrive at 200 writes/s; reads arrive at 800 reads/s. Replication lag under normal conditions is 40 ms.
Quorum overlap: why R+W > N guarantees a fresh read
With N=3, W=2, R=2: every write touches 2 of 3 nodes; every read contacts 2 of 3 nodes. By the pigeonhole principle, at least one node appears in both sets.
P(stale read) under eventual consistency (W=1, R=1)
When W=1, R=1 (no quorum), the write is acknowledged after reaching one node. The other two nodes receive the update asynchronously. A read arriving before replication completes will hit a stale replica.
Reduce write rate to 20/s (inter_write_interval = 50 ms): P(stale) ≈ 40/50 = 80%. Reduce replication lag to 2 ms (fast LAN): P(stale) ≈ 2/50 = 4%. The table shows the staleness surface across lag and write rate combinations:
| Replication lag | 100 writes/s (10 ms apart) | 20 writes/s (50 ms apart) | 1 write/s (1,000 ms apart) |
|---|---|---|---|
| 2 ms | 20% | 4% | 0.2% |
| 10 ms | 100% (always stale) | 20% | 1% |
| 40 ms | 100% | 80% | 4% |
| 200 ms | 100% | 100% | 20% |
Formula source: DeCandia et al. — Dynamo §4 (probabilistic staleness model); the approximation assumes uniform read arrival and Poisson write arrivals.
Decision math: picking R and W for read-heavy vs write-heavy workloads
The 800 read/s vs 200 write/s ratio means the read path is the bottleneck. With N=3:
| R | W | R+W vs N | Consistency | Read latency | Write latency | Best for |
|---|---|---|---|---|---|---|
| 1 | 3 | 4 > 3 ✓ | Strong | Lowest (1 hop) | Highest (all 3 nodes) | Very read-heavy; writes can be slow |
| 2 | 2 | 4 > 3 ✓ | Strong | Medium | Medium | Balanced (default choice) |
| 3 | 1 | 4 > 3 ✓ | Strong | Highest (all 3 nodes) | Lowest (1 hop) | Very write-heavy; reads can be slow |
| 1 | 1 | 2 = 3 ✗ | Eventual only | Lowest | Lowest | Acceptable staleness; maximum throughput |
For this order-status API (read-heavy, correctness matters for order state), R=1, W=3 is optimal: reads are served from any single replica (low latency for the high-volume read path) and writes touch all three nodes, ensuring any subsequent read — even from a cold replica — sees the latest write. Write latency is the cost: at 200 writes/s, the extra ~1 RTT to the third node adds roughly the replica round-trip time (~2–5 ms on LAN).
🧠 Quick check
1. The CAP theorem states that during a network partition, which two properties cannot both be guaranteed?
Partition tolerance is non-negotiable on real networks — you cannot prevent network failures. So when a partition occurs, the live choice is between C (refuse requests rather than return stale data) and A (respond even with potentially stale data). Availability and Partition Tolerance are not in tension — AP systems handle partitions by staying available. Consistency and Partition Tolerance are not in tension either — CP systems handle partitions by prioritising consistency.
2. With N=3, W=2, R=2 quorum, why is R+W > N the key condition?
R+W > N means R+W−N ≥ 1: by the pigeonhole principle, there is at least one node that was in both the write quorum and the read quorum. That node holds the latest write. The coordinator picks the highest-versioned response from the R nodes, which must include that overlap node. This is the entire correctness argument for quorum reads.
3. Which consistency property ensures a user always sees their own most recent write?
Read-your-writes consistency is specifically the guarantee that your own session sees your own writes — even if other clients might not yet. Monotonic reads prevent time-travelling backwards across reads but do not guarantee you see your own latest write. Linearisability gives you strong global consistency (everyone sees the latest), which is stronger than needed for this specific UX guarantee.
4. PACELC adds what insight beyond CAP?
CAP only describes what happens during a partition. PACELC adds the "else" clause: in normal operation (no partition), every additional consistency guarantee costs latency. A strongly consistent read requires contacting a quorum, waiting for responses, and returning the highest-versioned result — slower than reading from the nearest replica. PACELC forces you to budget that latency cost explicitly.
✍️ Exercise: consistency model choices for a collaborative document editor
You are designing the data layer for a collaborative document editor — multiple users can edit simultaneously. Consider three distinct data types in the system: character insertions (the actual content being typed), the document version number (used for optimistic locking and conflict detection), and billing credits (prepaid usage tokens, decremented on each AI suggestion).
For each of the three data types, answer: (1) which consistency model do you choose, (2) what is your justification, and (3) what failure mode does your choice accept?
Model answer:
- Character insertions → eventual consistency (CRDT-based). Simultaneous edits from two users must not block each other. The system uses a CRDT (Conflict-free Replicated Data Type) such as a sequence CRDT, which allows all replicas to accept writes immediately and converge to the same document state without coordination. The failure mode accepted: transient divergence — two users may see slightly different document states for 50–200 ms before convergence. This is acceptable and is the same model used by Google Docs and Figma. Latency on every keystroke must be sub-20 ms; strong consistency would require a round-trip quorum on every character, making the editor feel sluggish.
- Document version number → read-your-writes with optimistic locking. When a user saves a version explicitly, they must read back that version immediately to detect conflicts. Route post-save reads to the primary for a short window (2 s). Use an
If-Match: versionheader on save operations so conflicting concurrent saves return 409 rather than silently overwriting each other. The failure mode accepted: two concurrent saves from different sessions could both read version N, both increment to version N+1, and produce a conflict. The conflict is surfaced as a 409, not silently swallowed. - Billing credits → strong consistency (quorum or primary-only writes/reads). Billing credits are the one category where serving stale data causes real financial harm. If two AI suggestions read a credit balance of 5 simultaneously from different replicas, both proceed and decrement, resulting in the user going to −1 credits. Use W=N for credit decrements (write to all replicas synchronously, or write only to a strongly consistent store such as PostgreSQL with synchronous_commit = on). Reads route to the primary. The failure mode accepted: higher write latency for AI suggestions (50–150 ms extra) and reduced availability during a primary failure (suggestions pause until a replica is promoted).
Rubric: Full marks for correctly identifying that character insertions require AP semantics (not strong consistency), that billing requires CP semantics, and that the document version sits between the two with optimistic locking. Partial marks for two out of three correct with justifications. Bonus for naming CRDTs by type, or for explicitly stating that the billing failure mode (momentary 503) is preferable to the alternative failure mode (serving stale balance and allowing credit overdraft).
Key takeaways
- CAP forces a binary choice during partitions: consistency (refuse requests) or availability (serve possibly stale data). Partition tolerance is mandatory on real networks, so CA is not a real option.
- PACELC extends CAP to normal operation: even without a partition, strong consistency costs latency. Budget that cost explicitly in your design rather than discovering it in production.
- Quorum math gives you tunable consistency: R + W > N guarantees at least one node overlap between write and read sets. With N=3, W=2, R=2 gives strong consistency; W=1, R=1 gives eventual consistency.
- Read-your-writes and monotonic reads are session-level guarantees weaker than strong consistency but sufficient for most user-facing applications. Implement them with a primary-routing window after writes, not by making all reads go to the primary forever.
- Handle stale data explicitly: expose replication lag in monitoring, return
min_versiontokens in write responses, and route the critical read-after-write path to the primary — let the rest of your traffic enjoy the lower latency of replica reads.
Sources & further reading
- Gilbert & Lynch — Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services (the CAP proof)
- Martin Kleppmann — Designing Data-Intensive Applications (the definitive textbook on consistency models)
- DeCandia et al. — Dynamo: Amazon's Highly Available Key-Value Store (quorum and eventual consistency in production)
- Daniel Abadi — Consistency Tradeoffs in Modern Distributed Database System Design (PACELC)
- PostgreSQL Documentation — Monitoring Statistics (pg_stat_replication and replication lag)