Reliability & Scale · Lesson 16

Consistency, CAP & correctness at scale

When three nodes disagree about the value of a key, your system has to pick a side — and the side you pick defines the contract your API makes with every caller. This lesson gives you the vocabulary, the math, and the practical wiring to make that choice deliberately.

⏱ 15 min Difficulty: advanced Prereq: Scaling the database Prereq: Load balancing

By the end you'll be able to

State the CAP theorem precisely and explain why a real network partition forces you to choose between consistency and availability — not just balance them.
Calculate quorum requirements for N nodes and explain why R + W > N guarantees strong consistency.
Design a read-after-write strategy that prevents users from seeing stale data immediately after their own writes.

When nodes stop agreeing

Imagine a fleet of three database servers spread across two data centres. A fibre cut splits them: one side can reach the other side but messages are dropped. For a few seconds — or minutes — the two halves diverge. A user writes a new profile photo on the left side; a second user immediately requests that profile from the right side. What does your API return?

That is not an edge case. It is the defining question of distributed data systems, and every design decision you make about replication, routing, and caching is an answer to it — whether you realised it at the time or not. The CAP theorem makes the trade-off explicit.

The CAP theorem

In 2000, Eric Brewer conjectured — and in 2002 Gilbert and Lynch proved — that a distributed data store can satisfy at most two of three properties simultaneously:

Consistency (C): every read receives the most recent write or an error. No node serves stale data.
Availability (A): every request receives a non-error response, even if it may not reflect the latest write.
Partition tolerance (P): the system continues operating even when network messages between nodes are lost or delayed arbitrarily.

Because network partitions do happen in any real distributed system — link failures, packet loss, asymmetric routing — partition tolerance is not optional. You must tolerate it. That reduces the real choice to: during a partition, do you keep serving possibly-stale data (AP) or refuse requests until the partition heals (CP)?

The CAP triangle: every real distributed system lives on the CP–AP axis. CA is off the table — you cannot avoid partitions on real networks.

CP systems prioritise correctness. During a partition, a CP database returns an error rather than a stale value. The caller gets a clear signal: "I can't give you a reliable answer right now." HBase, ZooKeeper, and Google Spanner are CP. Your API returns 503 rather than wrong data.

AP systems prioritise uptime. During a partition, an AP database serves whatever value it currently holds, even if that value is seconds or minutes old. Cassandra and DynamoDB default to AP. Your API returns a response — potentially a stale one.

Neither is universally better. The choice belongs to the system's requirements.

PACELC: the full picture

CAP only describes behaviour under a partition. Daniel Abadi pointed out in 2012 that this leaves out the 99.9% of the time when there is no partition. Even then, you face a trade-off: latency versus consistency.

PACELC stands for: if there is a Partition, choose between Availability and Consistency; Else (normally), choose between Latency and Consistency.

System	Under partition (CA)	Normal operation (LC)	PACELC label
DynamoDB (default)	Availability — responds with possibly-stale data	Latency — eventually consistent reads are faster	PA / EL
DynamoDB (strong consistency)	Availability	Consistency — extra latency for quorum reads	PA / EC
Cassandra (ONE read)	Availability	Latency — reads from one replica	PA / EL
Cassandra (QUORUM read)	Availability	Consistency — reads from majority	PA / EC
HBase / etcd	Consistency — rejects requests during partition	Consistency — always linearisable	PC / EC
MySQL async replication	Availability	Latency — replica reads may lag	PA / EL

PACELC makes the design conversation more honest. When a colleague says "we need high availability and strong consistency," PACELC forces the question: at what read latency cost? Strong consistency usually means at least one extra network round-trip per read (the quorum acknowledgment), or routing all reads to the primary. That cost is real and must be budgeted.

Strong consistency and eventual consistency

Think of a flight departure board. A strongly consistent board shows the current gate every time — every display, everywhere in the terminal, shows the same value right now, or it shows nothing at all. An eventually consistent board might show Gate 12 on one screen and Gate 14 on another for 30 seconds after a gate change, then converge.

Strong consistency (also called linearisability) means: once a write is acknowledged, any subsequent read — from any node, from any client — sees that write or something newer. There is a single, total order of operations. From a caller's perspective, the distributed system behaves like a single machine.

Eventual consistency means: if no new writes arrive, all replicas will eventually converge to the same value. During the convergence window, different replicas may return different answers. Your API may serve stale data. The system makes a liveness guarantee (it will converge) but not a safety guarantee (it will be current right now).

Property	Strong consistency	Eventual consistency
Read reflects latest write?	Always, from any node	Only after propagation delay
Read latency	Higher — quorum or primary required	Lower — any replica can respond
Availability during partition	Reduced — some nodes refuse	Full — every node responds
Good fit for	Bank balances, inventory counts, configuration, tickets	Social feeds, view counts, shopping-cart drafts, DNS
Failure mode	Returns an error; never lies	Returns an answer; might be wrong

⚠️ "Eventual" does not tell you when

Eventual consistency says convergence will happen — it does not say when. Propagation lag can be 50 ms, 5 seconds, or permanent if you hit a split-brain scenario where two primaries accept conflicting writes. Design for the worst realistic delay your replication topology can produce, not the average.

Quorum: the math behind consistency

Quorum is the mechanism that lets you get strong consistency out of an eventually consistent cluster — without routing everything to a single primary. The key insight: if a write touches enough nodes, and a read touches enough nodes, the read set and write set must overlap. Overlap means the read will always hit at least one node that has the latest write.

The formula: R + W > N, where N = total replica count, W = write quorum (nodes that must acknowledge a write), R = read quorum (nodes that must respond to a read). When this holds, the overlap is guaranteed: overlap = R + W − N ≥ 1.

Worked example with N=3:

W=2, R=2: R+W = 4 > N=3. Overlap = 4−3 = 1 node guaranteed in common. Strong consistency.
W=3, R=1: R+W = 4 > 3. Writes to all nodes, reads from one. Strong consistency, but writes are slow.
W=1, R=3: R+W = 4 > 3. Fast writes, reads from all. Strong consistency, but reads are slow.
W=1, R=1: R+W = 2 = N. No guaranteed overlap. Eventual consistency only.

Write quorum (A+B, teal) and read quorum (B+C, periwinkle) share node B. B always has the latest write — so the read always returns the correct value even though C is stale.

In practice, the coordinator (the node the client contacts) sends the write to W nodes, waits for W acknowledgments, then returns success to the client. On a read, it contacts R nodes, collects R responses, and returns the value with the highest version number or timestamp. No single point of failure, no single-primary routing required — the math guarantees correctness.

Read-your-writes and monotonic reads

Two consistency guarantees are lower than strong consistency but matter enormously for user experience:

Read-your-writes consistency guarantees that after you write, you see your own write when you read — from the same session, even if other clients might not see it yet. Without this, you submit a form, the page reloads, and your change has vanished. This is the minimum bar for anything user-facing. It does not require all clients to be consistent, only that the session that performed the write sees it.

Monotonic reads guarantee that once you read a value at version V, you will never subsequently read a version earlier than V. Without monotonic reads, a user could refresh a page and see an older version of data than they saw one refresh ago — as if time ran backwards. This happens when consecutive requests land on different replicas at different replication offsets.

Both of these are session-level guarantees. They are weaker than strong consistency (different clients may still see different values) but much stronger than bare eventual consistency.

Handling stale data in practice

Replica lag (covered in the previous lesson) is the primary source of stale reads. When a replica is 200 ms behind the primary, any read routed to that replica may return data that is up to 200 ms old. Here are the techniques engineers use to handle this in API design:

Technique	How it works	Cost
Route read-after-write to primary	After a write, route the next N reads (or reads within the next T seconds) from the same session to the primary. Once the window passes, fall back to replicas.	Slightly higher read load on primary; requires session state to track recency.
Min-version / replication token	The write response includes a `min_version` or `write_fence` token. The client sends this token with subsequent reads. The server rejects the read from any replica that has not yet applied that version and retries from a more up-to-date one.	Adds a header to the API contract; requires replicas to expose their replication offset.
Wait-for-replication option	The write request accepts a parameter like `consistency=quorum` that blocks until the specified number of replicas acknowledge the write before returning. The client then knows all subsequent quorum reads will see the write.	Increases write latency proportionally; must be opt-in to avoid penalising all writes.
ETag / If-None-Match	Responses carry an ETag representing the data version. A client that holds a recent ETag can send `If-None-Match`; a replica that hasn't applied the latest version returns `412 Precondition Failed` rather than a stale 200.	Requires the storage layer to version data and expose ETags; adds conditional-request logic to the API.
Accept stale, expose it	For non-critical reads, serve stale data but include a header like `X-Data-Version` or `X-Replication-Lag-Ms` so callers can reason about freshness themselves.	Shifts the correctness burden to the caller; appropriate for social feeds, analytics, search results.

✅ Practical: the primary-fallback window

For most web APIs, the simplest read-after-write implementation is: tag the user session with the timestamp of their most recent write. For the next 2 seconds after a write, route all reads from that session to the primary. After 2 seconds, fall back to replicas. This requires one session store key and two routing rules. It handles the user who hits refresh immediately after saving without penalising steady-state read traffic. Include a min_version token in write responses for clients that need a stronger guarantee than the timeout-based window.

Under the hood: how it actually works

Trace A: quorum read returning the correct value (N=3, W=2, R=2)

Client writes user_id=42, balance=100. The coordinator sends the write to nodes A and B. Node C is temporarily slow and misses the write. Client then reads immediately — the coordinator picks B and C as its read quorum.

# t=0ms — Client sends write to coordinator (node A) WRITE user_id=42 balance=100 version=7 # t=1ms — Coordinator forwards to W=2 nodes → Node A ack version=7 (local write) → Node B ack version=7 (replicated) → Node C ··· (slow, not in write quorum) # t=3ms — W=2 acks received. Write confirmed to client. WRITE OK version=7 # t=5ms — Client sends read. Coordinator picks R=2 nodes: B and C READ user_id=42 # t=6ms — Coordinator collects responses from B and C Node B → balance=100 version=7 Node C → balance=99 version=6 (stale — missed the write) # Coordinator applies conflict resolution: highest version wins READ RESULT balance=100 version=7 (from Node B) # The overlap guarantee: B was in BOTH the write quorum and the read quorum. # R+W-N = 2+2-3 = 1 node overlap guaranteed → correct value always reachable.

Trace B: eventual consistency — a second client reads stale data

Same cluster. This time the write uses W=1 (no quorum). A second client reads from a replica that hasn't caught up yet.

# t=0ms — Client 1 writes to primary (W=1, no quorum wait) WRITE post_id=801 title="New post" → Primary Primary ack immediately version=51 # t=0ms — Primary begins async replication to Replica 1 and Replica 2 # t=+40ms — Replica 1 applies the write (version=51 now on Replica 1) # t=+80ms — Replica 2 applies the write (version=51 now on Replica 2) # t=+50ms — Client 2 reads from Replica 2 (which is still at version=50) READ post_id=801 → Replica 2 Replica 2 → 404 Not Found (version=50, write not yet applied) # Client 2 sees a post that does not exist — yet. 30ms later it would. # This is valid eventual consistency behaviour, not a bug. # t=+80ms — Client 2 retries the same read READ post_id=801 → Replica 2 Replica 2 → title="New post" version=51 (converged)

How to debug & inspect it

In PostgreSQL, replication lag is directly observable. The primary records each replica's replication state in pg_stat_replication. The replay_lag column tells you how far behind each standby is — combine this with the min_version token from your write response to decide whether a given replica is safe to read from.

-- Check current replication lag on the primary SELECT application_name, state, write_lag, flush_lag, replay_lag, sync_state FROM pg_stat_replication; -- Example output: application_name │ state │ write_lag │ flush_lag │ replay_lag │ sync_state ──────────────────┼───────────┼───────────┼───────────┼────────────┼──────────── replica-eu-1 │ streaming │ 00:00:00 │ 00:00:00 │ 00:00:00 │ async replica-us-2 │ streaming │ 00:00:00 │ 00:00:00 │ 00:00:00.2 │ async replica-ap-3 │ streaming │ 00:00:01 │ 00:00:01 │ 00:00:03.8 │ async -- replica-ap-3 is 3.8 seconds behind. Route read-after-write away from it. -- On a replica: check its own lag versus the primary SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay; -- If NULL, this instance is the primary (not a replica)

Symptom	Likely cause	Fix
User wrote data then didn't see their change	Read-after-write routed to a lagging replica	Route post-write reads to primary for a short window; check `replay_lag` in `pg_stat_replication`
Two users see different values for the same key simultaneously	Requests hitting replicas at different replication offsets; or quorum too low (R+W ≤ N)	Increase read or write quorum so R+W > N; or use sticky session routing
Read returns old data even after write was confirmed	Write quorum was met, but read quorum didn't include an up-to-date node; or replica lag exceeded read timeout	Verify R+W > N; add `min_version` token to read requests; fall back to primary if token check fails
Primary refuses write during partition	CP behaviour — system correctly prioritises consistency over availability	Expected in a CP system; return 503 with `Retry-After`; document this as a known failure mode in API contracts
Replica serves stale data after network heals	Replica was partitioned and accumulated a write backlog; replication is catching up	Monitor `replay_lag` post-healing; exclude the replica from the read pool until lag drops below threshold; alert if lag exceeds SLO

🎯 Interview angle

When a system design question asks you to design a data layer, name the consistency model you are choosing — and justify it. "I'll use eventual consistency" is incomplete. Say: "I'll use eventual consistency for the activity feed because users tolerate a few seconds of delay and the extra replica availability matters more than guaranteed freshness. I'll use strong consistency for the account balance because serving a stale balance and then rejecting a payment is worse than a momentary 503." Examiners are listening for whether you treat consistency as a deliberate architectural trade-off, not a default you accept without thought. What would you choose for a social feed vs a bank balance?

✅ Min-version tokens in API responses

Include a X-Write-Version (or min_version) field in write response bodies. The value is the replication sequence number or logical clock value that the write advanced to. Clients that need read-after-write semantics send this back as a request header; your read path rejects any replica whose replication offset is below that value and retries elsewhere. This gives clients opt-in strong read-after-write without routing all reads to the primary permanently.

By the numbers

Make the math concrete. Scenario: an order-status API backed by a 3-node Cassandra cluster. Writes arrive at 200 writes/s; reads arrive at 800 reads/s. Replication lag under normal conditions is 40 ms.

Quorum overlap: why R+W > N guarantees a fresh read

With N=3, W=2, R=2: every write touches 2 of 3 nodes; every read contacts 2 of 3 nodes. By the pigeonhole principle, at least one node appears in both sets.

# N=3, W=2, R=2 → R+W = 4 > 3 = N → overlap = R+W−N = 1 node guaranteed Write touches {A, B} (W=2) Read contacts {B, C} (R=2) ↑ Node B is in BOTH sets — coordinator picks max-version among B and C = correct value # Worst case: write went to {A,B}, read goes to {B,C} — overlap = B ✓ # Another case: write went to {A,C}, read goes to {A,B} — overlap = A ✓ # Both: read always includes at least one node with the latest write.

P(stale read) under eventual consistency (W=1, R=1)

When W=1, R=1 (no quorum), the write is acknowledged after reaching one node. The other two nodes receive the update asynchronously. A read arriving before replication completes will hit a stale replica.

P(stale read) ≈ replication_lag / inter_write_interval replication_lag = 40 ms (time for write to propagate to all replicas) inter_write_interval = 1 / write_rate = 1000 ms / 200 = 5 ms (writes every 5 ms) P(stale read) ≈ 40 / 5 = 8.0 → capped at 1.0 (100%) # 200 writes/s with 40 ms lag → at any instant, the most recent ~8 writes are # still propagating. With R=1 reads, most reads hit a replica not yet converged.

Reduce write rate to 20/s (inter_write_interval = 50 ms): P(stale) ≈ 40/50 = 80%. Reduce replication lag to 2 ms (fast LAN): P(stale) ≈ 2/50 = 4%. The table shows the staleness surface across lag and write rate combinations:

Replication lag	100 writes/s (10 ms apart)	20 writes/s (50 ms apart)	1 write/s (1,000 ms apart)
2 ms	20%	4%	0.2%
10 ms	100% (always stale)	20%	1%
40 ms	100%	80%	4%
200 ms	100%	100%	20%

Formula source: DeCandia et al. — Dynamo §4 (probabilistic staleness model); the approximation assumes uniform read arrival and Poisson write arrivals.

Decision math: picking R and W for read-heavy vs write-heavy workloads

The 800 read/s vs 200 write/s ratio means the read path is the bottleneck. With N=3:

R	W	R+W vs N	Consistency	Read latency	Write latency	Best for
1	3	4 > 3 ✓	Strong	Lowest (1 hop)	Highest (all 3 nodes)	Very read-heavy; writes can be slow
2	2	4 > 3 ✓	Strong	Medium	Medium	Balanced (default choice)
3	1	4 > 3 ✓	Strong	Highest (all 3 nodes)	Lowest (1 hop)	Very write-heavy; reads can be slow
1	1	2 = 3 ✗	Eventual only	Lowest	Lowest	Acceptable staleness; maximum throughput

For this order-status API (read-heavy, correctness matters for order state), R=1, W=3 is optimal: reads are served from any single replica (low latency for the high-volume read path) and writes touch all three nodes, ensuring any subsequent read — even from a cold replica — sees the latest write. Write latency is the cost: at 200 writes/s, the extra ~1 RTT to the third node adds roughly the replica round-trip time (~2–5 ms on LAN).

# Order-status API: 800 reads/s, 200 writes/s, N=3, LAN RTT=2ms # Option A: R=1, W=3 (strong, read-optimised) Read cost: 1 replica × 2 ms = 2 ms latency Write cost: 3 replicas; coordinator waits for slowest = ~4–6 ms P(stale read) = 0% (R+W=4>3, overlap guaranteed) # Option B: R=2, W=2 (strong, balanced) Read cost: 2 replicas; coordinator waits for slower = ~3–4 ms Write cost: 2 replicas = ~3–4 ms P(stale read) = 0% # Option C: R=1, W=1 (eventual, maximum throughput) Read cost: ~2 ms Write cost: ~2 ms P(stale read) ≈ 80% (40ms lag / 5ms inter-write = ~100%, capped per above) # Break-even: use eventual consistency only when staleness probability < your SLO tolerance.

🧠 Quick check

1. The CAP theorem states that during a network partition, which two properties cannot both be guaranteed?

Partition tolerance is non-negotiable on real networks — you cannot prevent network failures. So when a partition occurs, the live choice is between C (refuse requests rather than return stale data) and A (respond even with potentially stale data). Availability and Partition Tolerance are not in tension — AP systems handle partitions by staying available. Consistency and Partition Tolerance are not in tension either — CP systems handle partitions by prioritising consistency.

2. With N=3, W=2, R=2 quorum, why is R+W > N the key condition?

R+W > N means R+W−N ≥ 1: by the pigeonhole principle, there is at least one node that was in both the write quorum and the read quorum. That node holds the latest write. The coordinator picks the highest-versioned response from the R nodes, which must include that overlap node. This is the entire correctness argument for quorum reads.

3. Which consistency property ensures a user always sees their own most recent write?

Read-your-writes consistency is specifically the guarantee that your own session sees your own writes — even if other clients might not yet. Monotonic reads prevent time-travelling backwards across reads but do not guarantee you see your own latest write. Linearisability gives you strong global consistency (everyone sees the latest), which is stronger than needed for this specific UX guarantee.

4. PACELC adds what insight beyond CAP?

CAP only describes what happens during a partition. PACELC adds the "else" clause: in normal operation (no partition), every additional consistency guarantee costs latency. A strongly consistent read requires contacting a quorum, waiting for responses, and returning the highest-versioned result — slower than reading from the nearest replica. PACELC forces you to budget that latency cost explicitly.

✍️ Exercise: consistency model choices for a collaborative document editor

You are designing the data layer for a collaborative document editor — multiple users can edit simultaneously. Consider three distinct data types in the system: character insertions (the actual content being typed), the document version number (used for optimistic locking and conflict detection), and billing credits (prepaid usage tokens, decremented on each AI suggestion).

For each of the three data types, answer: (1) which consistency model do you choose, (2) what is your justification, and (3) what failure mode does your choice accept?

Model answer:

Character insertions → eventual consistency (CRDT-based). Simultaneous edits from two users must not block each other. The system uses a CRDT (Conflict-free Replicated Data Type) such as a sequence CRDT, which allows all replicas to accept writes immediately and converge to the same document state without coordination. The failure mode accepted: transient divergence — two users may see slightly different document states for 50–200 ms before convergence. This is acceptable and is the same model used by Google Docs and Figma. Latency on every keystroke must be sub-20 ms; strong consistency would require a round-trip quorum on every character, making the editor feel sluggish.
Document version number → read-your-writes with optimistic locking. When a user saves a version explicitly, they must read back that version immediately to detect conflicts. Route post-save reads to the primary for a short window (2 s). Use an If-Match: version header on save operations so conflicting concurrent saves return 409 rather than silently overwriting each other. The failure mode accepted: two concurrent saves from different sessions could both read version N, both increment to version N+1, and produce a conflict. The conflict is surfaced as a 409, not silently swallowed.
Billing credits → strong consistency (quorum or primary-only writes/reads). Billing credits are the one category where serving stale data causes real financial harm. If two AI suggestions read a credit balance of 5 simultaneously from different replicas, both proceed and decrement, resulting in the user going to −1 credits. Use W=N for credit decrements (write to all replicas synchronously, or write only to a strongly consistent store such as PostgreSQL with synchronous_commit = on). Reads route to the primary. The failure mode accepted: higher write latency for AI suggestions (50–150 ms extra) and reduced availability during a primary failure (suggestions pause until a replica is promoted).

Rubric: Full marks for correctly identifying that character insertions require AP semantics (not strong consistency), that billing requires CP semantics, and that the document version sits between the two with optimistic locking. Partial marks for two out of three correct with justifications. Bonus for naming CRDTs by type, or for explicitly stating that the billing failure mode (momentary 503) is preferable to the alternative failure mode (serving stale balance and allowing credit overdraft).

Key takeaways

CAP forces a binary choice during partitions: consistency (refuse requests) or availability (serve possibly stale data). Partition tolerance is mandatory on real networks, so CA is not a real option.
PACELC extends CAP to normal operation: even without a partition, strong consistency costs latency. Budget that cost explicitly in your design rather than discovering it in production.
Quorum math gives you tunable consistency: R + W > N guarantees at least one node overlap between write and read sets. With N=3, W=2, R=2 gives strong consistency; W=1, R=1 gives eventual consistency.
Read-your-writes and monotonic reads are session-level guarantees weaker than strong consistency but sufficient for most user-facing applications. Implement them with a primary-routing window after writes, not by making all reads go to the primary forever.
Handle stale data explicitly: expose replication lag in monitoring, return min_version tokens in write responses, and route the critical read-after-write path to the primary — let the rest of your traffic enjoy the lower latency of replica reads.