API Design

Reliability & Scale · Lesson 16

Consistency, CAP & correctness at scale

When three nodes disagree about the value of a key, your system has to pick a side — and the side you pick defines the contract your API makes with every caller. This lesson gives you the vocabulary, the math, and the practical wiring to make that choice deliberately.

⏱ 15 min Difficulty: advanced Prereq: Scaling the database Prereq: Load balancing

By the end you'll be able to

When nodes stop agreeing

Imagine a fleet of three database servers spread across two data centres. A fibre cut splits them: one side can reach the other side but messages are dropped. For a few seconds — or minutes — the two halves diverge. A user writes a new profile photo on the left side; a second user immediately requests that profile from the right side. What does your API return?

That is not an edge case. It is the defining question of distributed data systems, and every design decision you make about replication, routing, and caching is an answer to it — whether you realised it at the time or not. The CAP theorem makes the trade-off explicit.

The CAP theorem

In 2000, Eric Brewer conjectured — and in 2002 Gilbert and Lynch proved — that a distributed data store can satisfy at most two of three properties simultaneously:

Because network partitions do happen in any real distributed system — link failures, packet loss, asymmetric routing — partition tolerance is not optional. You must tolerate it. That reduces the real choice to: during a partition, do you keep serving possibly-stale data (AP) or refuse requests until the partition heals (CP)?

Consistency every read is current Partition Tolerance survives splits Availability always responds CP refuse requests during partition HBase, Zookeeper, etcd, Spanner AP serve stale data, always respond Cassandra, DynamoDB, CouchDB, Riak CA requires zero partitions (not achievable on real networks) network partition
The CAP triangle: every real distributed system lives on the CP–AP axis. CA is off the table — you cannot avoid partitions on real networks.

CP systems prioritise correctness. During a partition, a CP database returns an error rather than a stale value. The caller gets a clear signal: "I can't give you a reliable answer right now." HBase, ZooKeeper, and Google Spanner are CP. Your API returns 503 rather than wrong data.

AP systems prioritise uptime. During a partition, an AP database serves whatever value it currently holds, even if that value is seconds or minutes old. Cassandra and DynamoDB default to AP. Your API returns a response — potentially a stale one.

Neither is universally better. The choice belongs to the system's requirements.

PACELC: the full picture

CAP only describes behaviour under a partition. Daniel Abadi pointed out in 2012 that this leaves out the 99.9% of the time when there is no partition. Even then, you face a trade-off: latency versus consistency.

PACELC stands for: if there is a Partition, choose between Availability and Consistency; Else (normally), choose between Latency and Consistency.

SystemUnder partition (CA)Normal operation (LC)PACELC label
DynamoDB (default) Availability — responds with possibly-stale data Latency — eventually consistent reads are faster PA / EL
DynamoDB (strong consistency) Availability Consistency — extra latency for quorum reads PA / EC
Cassandra (ONE read) Availability Latency — reads from one replica PA / EL
Cassandra (QUORUM read) Availability Consistency — reads from majority PA / EC
HBase / etcd Consistency — rejects requests during partition Consistency — always linearisable PC / EC
MySQL async replication Availability Latency — replica reads may lag PA / EL

PACELC makes the design conversation more honest. When a colleague says "we need high availability and strong consistency," PACELC forces the question: at what read latency cost? Strong consistency usually means at least one extra network round-trip per read (the quorum acknowledgment), or routing all reads to the primary. That cost is real and must be budgeted.

Strong consistency and eventual consistency

Think of a flight departure board. A strongly consistent board shows the current gate every time — every display, everywhere in the terminal, shows the same value right now, or it shows nothing at all. An eventually consistent board might show Gate 12 on one screen and Gate 14 on another for 30 seconds after a gate change, then converge.

Strong consistency (also called linearisability) means: once a write is acknowledged, any subsequent read — from any node, from any client — sees that write or something newer. There is a single, total order of operations. From a caller's perspective, the distributed system behaves like a single machine.

Eventual consistency means: if no new writes arrive, all replicas will eventually converge to the same value. During the convergence window, different replicas may return different answers. Your API may serve stale data. The system makes a liveness guarantee (it will converge) but not a safety guarantee (it will be current right now).

PropertyStrong consistencyEventual consistency
Read reflects latest write? Always, from any node Only after propagation delay
Read latency Higher — quorum or primary required Lower — any replica can respond
Availability during partition Reduced — some nodes refuse Full — every node responds
Good fit for Bank balances, inventory counts, configuration, tickets Social feeds, view counts, shopping-cart drafts, DNS
Failure mode Returns an error; never lies Returns an answer; might be wrong
⚠️ "Eventual" does not tell you when

Eventual consistency says convergence will happen — it does not say when. Propagation lag can be 50 ms, 5 seconds, or permanent if you hit a split-brain scenario where two primaries accept conflicting writes. Design for the worst realistic delay your replication topology can produce, not the average.

Quorum: the math behind consistency

Quorum is the mechanism that lets you get strong consistency out of an eventually consistent cluster — without routing everything to a single primary. The key insight: if a write touches enough nodes, and a read touches enough nodes, the read set and write set must overlap. Overlap means the read will always hit at least one node that has the latest write.

The formula: R + W > N, where N = total replica count, W = write quorum (nodes that must acknowledge a write), R = read quorum (nodes that must respond to a read). When this holds, the overlap is guaranteed: overlap = R + W − N ≥ 1.

Worked example with N=3:

Write Quorum (W=2) Nodes A + B acknowledge write Read Quorum (R=2) Nodes B + C respond to read guaranteed overlap R+W−N = 1 node A v=42 ✓ WRITE ack B v=42 ✓ WRITE ack + READ C v=41 (old) READ only N=3 nodes · W=2 · R=2 · R+W=4 > N=3 · overlap guaranteed
Write quorum (A+B, teal) and read quorum (B+C, periwinkle) share node B. B always has the latest write — so the read always returns the correct value even though C is stale.

In practice, the coordinator (the node the client contacts) sends the write to W nodes, waits for W acknowledgments, then returns success to the client. On a read, it contacts R nodes, collects R responses, and returns the value with the highest version number or timestamp. No single point of failure, no single-primary routing required — the math guarantees correctness.

Read-your-writes and monotonic reads

Two consistency guarantees are lower than strong consistency but matter enormously for user experience:

Read-your-writes consistency guarantees that after you write, you see your own write when you read — from the same session, even if other clients might not see it yet. Without this, you submit a form, the page reloads, and your change has vanished. This is the minimum bar for anything user-facing. It does not require all clients to be consistent, only that the session that performed the write sees it.

Monotonic reads guarantee that once you read a value at version V, you will never subsequently read a version earlier than V. Without monotonic reads, a user could refresh a page and see an older version of data than they saw one refresh ago — as if time ran backwards. This happens when consecutive requests land on different replicas at different replication offsets.

Both of these are session-level guarantees. They are weaker than strong consistency (different clients may still see different values) but much stronger than bare eventual consistency.

Handling stale data in practice

Replica lag (covered in the previous lesson) is the primary source of stale reads. When a replica is 200 ms behind the primary, any read routed to that replica may return data that is up to 200 ms old. Here are the techniques engineers use to handle this in API design:

TechniqueHow it worksCost
Route read-after-write to primary After a write, route the next N reads (or reads within the next T seconds) from the same session to the primary. Once the window passes, fall back to replicas. Slightly higher read load on primary; requires session state to track recency.
Min-version / replication token The write response includes a min_version or write_fence token. The client sends this token with subsequent reads. The server rejects the read from any replica that has not yet applied that version and retries from a more up-to-date one. Adds a header to the API contract; requires replicas to expose their replication offset.
Wait-for-replication option The write request accepts a parameter like consistency=quorum that blocks until the specified number of replicas acknowledge the write before returning. The client then knows all subsequent quorum reads will see the write. Increases write latency proportionally; must be opt-in to avoid penalising all writes.
ETag / If-None-Match Responses carry an ETag representing the data version. A client that holds a recent ETag can send If-None-Match; a replica that hasn't applied the latest version returns 412 Precondition Failed rather than a stale 200. Requires the storage layer to version data and expose ETags; adds conditional-request logic to the API.
Accept stale, expose it For non-critical reads, serve stale data but include a header like X-Data-Version or X-Replication-Lag-Ms so callers can reason about freshness themselves. Shifts the correctness burden to the caller; appropriate for social feeds, analytics, search results.
✅ Practical: the primary-fallback window

For most web APIs, the simplest read-after-write implementation is: tag the user session with the timestamp of their most recent write. For the next 2 seconds after a write, route all reads from that session to the primary. After 2 seconds, fall back to replicas. This requires one session store key and two routing rules. It handles the user who hits refresh immediately after saving without penalising steady-state read traffic. Include a min_version token in write responses for clients that need a stronger guarantee than the timeout-based window.

Under the hood: how it actually works

Trace A: quorum read returning the correct value (N=3, W=2, R=2)

Client writes user_id=42, balance=100. The coordinator sends the write to nodes A and B. Node C is temporarily slow and misses the write. Client then reads immediately — the coordinator picks B and C as its read quorum.

# t=0ms — Client sends write to coordinator (node A) WRITE user_id=42 balance=100 version=7 # t=1ms — Coordinator forwards to W=2 nodes → Node A ack version=7 (local write) → Node B ack version=7 (replicated) → Node C ··· (slow, not in write quorum) # t=3ms — W=2 acks received. Write confirmed to client. WRITE OK version=7 # t=5ms — Client sends read. Coordinator picks R=2 nodes: B and C READ user_id=42 # t=6ms — Coordinator collects responses from B and C Node B → balance=100 version=7 Node C → balance=99 version=6 (stale — missed the write) # Coordinator applies conflict resolution: highest version wins READ RESULT balance=100 version=7 (from Node B) # The overlap guarantee: B was in BOTH the write quorum and the read quorum. # R+W-N = 2+2-3 = 1 node overlap guaranteed → correct value always reachable.

Trace B: eventual consistency — a second client reads stale data

Same cluster. This time the write uses W=1 (no quorum). A second client reads from a replica that hasn't caught up yet.

# t=0ms — Client 1 writes to primary (W=1, no quorum wait) WRITE post_id=801 title="New post" → Primary Primary ack immediately version=51 # t=0ms — Primary begins async replication to Replica 1 and Replica 2 # t=+40ms — Replica 1 applies the write (version=51 now on Replica 1) # t=+80ms — Replica 2 applies the write (version=51 now on Replica 2) # t=+50ms — Client 2 reads from Replica 2 (which is still at version=50) READ post_id=801 → Replica 2 Replica 2 → 404 Not Found (version=50, write not yet applied) # Client 2 sees a post that does not exist — yet. 30ms later it would. # This is valid eventual consistency behaviour, not a bug. # t=+80ms — Client 2 retries the same read READ post_id=801 → Replica 2 Replica 2 → title="New post" version=51 (converged)

How to debug & inspect it

In PostgreSQL, replication lag is directly observable. The primary records each replica's replication state in pg_stat_replication. The replay_lag column tells you how far behind each standby is — combine this with the min_version token from your write response to decide whether a given replica is safe to read from.

-- Check current replication lag on the primary SELECT application_name, state, write_lag, flush_lag, replay_lag, sync_state FROM pg_stat_replication; -- Example output: application_name │ state │ write_lag │ flush_lag │ replay_lag │ sync_state ──────────────────┼───────────┼───────────┼───────────┼────────────┼──────────── replica-eu-1 │ streaming │ 00:00:00 │ 00:00:00 │ 00:00:00 │ async replica-us-2 │ streaming │ 00:00:00 │ 00:00:00 │ 00:00:00.2 │ async replica-ap-3 │ streaming │ 00:00:01 │ 00:00:01 │ 00:00:03.8 │ async -- replica-ap-3 is 3.8 seconds behind. Route read-after-write away from it. -- On a replica: check its own lag versus the primary SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay; -- If NULL, this instance is the primary (not a replica)
SymptomLikely causeFix
User wrote data then didn't see their change Read-after-write routed to a lagging replica Route post-write reads to primary for a short window; check replay_lag in pg_stat_replication
Two users see different values for the same key simultaneously Requests hitting replicas at different replication offsets; or quorum too low (R+W ≤ N) Increase read or write quorum so R+W > N; or use sticky session routing
Read returns old data even after write was confirmed Write quorum was met, but read quorum didn't include an up-to-date node; or replica lag exceeded read timeout Verify R+W > N; add min_version token to read requests; fall back to primary if token check fails
Primary refuses write during partition CP behaviour — system correctly prioritises consistency over availability Expected in a CP system; return 503 with Retry-After; document this as a known failure mode in API contracts
Replica serves stale data after network heals Replica was partitioned and accumulated a write backlog; replication is catching up Monitor replay_lag post-healing; exclude the replica from the read pool until lag drops below threshold; alert if lag exceeds SLO
🎯 Interview angle

When a system design question asks you to design a data layer, name the consistency model you are choosing — and justify it. "I'll use eventual consistency" is incomplete. Say: "I'll use eventual consistency for the activity feed because users tolerate a few seconds of delay and the extra replica availability matters more than guaranteed freshness. I'll use strong consistency for the account balance because serving a stale balance and then rejecting a payment is worse than a momentary 503." Examiners are listening for whether you treat consistency as a deliberate architectural trade-off, not a default you accept without thought. What would you choose for a social feed vs a bank balance?

✅ Min-version tokens in API responses

Include a X-Write-Version (or min_version) field in write response bodies. The value is the replication sequence number or logical clock value that the write advanced to. Clients that need read-after-write semantics send this back as a request header; your read path rejects any replica whose replication offset is below that value and retries elsewhere. This gives clients opt-in strong read-after-write without routing all reads to the primary permanently.

By the numbers

Make the math concrete. Scenario: an order-status API backed by a 3-node Cassandra cluster. Writes arrive at 200 writes/s; reads arrive at 800 reads/s. Replication lag under normal conditions is 40 ms.

Quorum overlap: why R+W > N guarantees a fresh read

With N=3, W=2, R=2: every write touches 2 of 3 nodes; every read contacts 2 of 3 nodes. By the pigeonhole principle, at least one node appears in both sets.

# N=3, W=2, R=2 → R+W = 4 > 3 = N → overlap = R+W−N = 1 node guaranteed Write touches {A, B} (W=2) Read contacts {B, C} (R=2) ↑ Node B is in BOTH sets — coordinator picks max-version among B and C = correct value # Worst case: write went to {A,B}, read goes to {B,C} — overlap = B ✓ # Another case: write went to {A,C}, read goes to {A,B} — overlap = A ✓ # Both: read always includes at least one node with the latest write.

P(stale read) under eventual consistency (W=1, R=1)

When W=1, R=1 (no quorum), the write is acknowledged after reaching one node. The other two nodes receive the update asynchronously. A read arriving before replication completes will hit a stale replica.

P(stale read) ≈ replication_lag / inter_write_interval replication_lag = 40 ms (time for write to propagate to all replicas) inter_write_interval = 1 / write_rate = 1000 ms / 200 = 5 ms (writes every 5 ms) P(stale read) ≈ 40 / 5 = 8.0 → capped at 1.0 (100%) # 200 writes/s with 40 ms lag → at any instant, the most recent ~8 writes are # still propagating. With R=1 reads, most reads hit a replica not yet converged.

Reduce write rate to 20/s (inter_write_interval = 50 ms): P(stale) ≈ 40/50 = 80%. Reduce replication lag to 2 ms (fast LAN): P(stale) ≈ 2/50 = 4%. The table shows the staleness surface across lag and write rate combinations:

Replication lag100 writes/s (10 ms apart)20 writes/s (50 ms apart)1 write/s (1,000 ms apart)
2 ms20%4%0.2%
10 ms100% (always stale)20%1%
40 ms100%80%4%
200 ms100%100%20%

Formula source: DeCandia et al. — Dynamo §4 (probabilistic staleness model); the approximation assumes uniform read arrival and Poisson write arrivals.

Decision math: picking R and W for read-heavy vs write-heavy workloads

The 800 read/s vs 200 write/s ratio means the read path is the bottleneck. With N=3:

RWR+W vs NConsistencyRead latencyWrite latencyBest for
134 > 3 ✓StrongLowest (1 hop)Highest (all 3 nodes)Very read-heavy; writes can be slow
224 > 3 ✓StrongMediumMediumBalanced (default choice)
314 > 3 ✓StrongHighest (all 3 nodes)Lowest (1 hop)Very write-heavy; reads can be slow
112 = 3 ✗Eventual onlyLowestLowestAcceptable staleness; maximum throughput

For this order-status API (read-heavy, correctness matters for order state), R=1, W=3 is optimal: reads are served from any single replica (low latency for the high-volume read path) and writes touch all three nodes, ensuring any subsequent read — even from a cold replica — sees the latest write. Write latency is the cost: at 200 writes/s, the extra ~1 RTT to the third node adds roughly the replica round-trip time (~2–5 ms on LAN).

# Order-status API: 800 reads/s, 200 writes/s, N=3, LAN RTT=2ms # Option A: R=1, W=3 (strong, read-optimised) Read cost: 1 replica × 2 ms = 2 ms latency Write cost: 3 replicas; coordinator waits for slowest = ~4–6 ms P(stale read) = 0% (R+W=4>3, overlap guaranteed) # Option B: R=2, W=2 (strong, balanced) Read cost: 2 replicas; coordinator waits for slower = ~3–4 ms Write cost: 2 replicas = ~3–4 ms P(stale read) = 0% # Option C: R=1, W=1 (eventual, maximum throughput) Read cost: ~2 ms Write cost: ~2 ms P(stale read) ≈ 80% (40ms lag / 5ms inter-write = ~100%, capped per above) # Break-even: use eventual consistency only when staleness probability < your SLO tolerance.

🧠 Quick check

1. The CAP theorem states that during a network partition, which two properties cannot both be guaranteed?

Partition tolerance is non-negotiable on real networks — you cannot prevent network failures. So when a partition occurs, the live choice is between C (refuse requests rather than return stale data) and A (respond even with potentially stale data). Availability and Partition Tolerance are not in tension — AP systems handle partitions by staying available. Consistency and Partition Tolerance are not in tension either — CP systems handle partitions by prioritising consistency.

2. With N=3, W=2, R=2 quorum, why is R+W > N the key condition?

R+W > N means R+W−N ≥ 1: by the pigeonhole principle, there is at least one node that was in both the write quorum and the read quorum. That node holds the latest write. The coordinator picks the highest-versioned response from the R nodes, which must include that overlap node. This is the entire correctness argument for quorum reads.

3. Which consistency property ensures a user always sees their own most recent write?

Read-your-writes consistency is specifically the guarantee that your own session sees your own writes — even if other clients might not yet. Monotonic reads prevent time-travelling backwards across reads but do not guarantee you see your own latest write. Linearisability gives you strong global consistency (everyone sees the latest), which is stronger than needed for this specific UX guarantee.

4. PACELC adds what insight beyond CAP?

CAP only describes what happens during a partition. PACELC adds the "else" clause: in normal operation (no partition), every additional consistency guarantee costs latency. A strongly consistent read requires contacting a quorum, waiting for responses, and returning the highest-versioned result — slower than reading from the nearest replica. PACELC forces you to budget that latency cost explicitly.

✍️ Exercise: consistency model choices for a collaborative document editor

You are designing the data layer for a collaborative document editor — multiple users can edit simultaneously. Consider three distinct data types in the system: character insertions (the actual content being typed), the document version number (used for optimistic locking and conflict detection), and billing credits (prepaid usage tokens, decremented on each AI suggestion).

For each of the three data types, answer: (1) which consistency model do you choose, (2) what is your justification, and (3) what failure mode does your choice accept?

Model answer:

  1. Character insertions → eventual consistency (CRDT-based). Simultaneous edits from two users must not block each other. The system uses a CRDT (Conflict-free Replicated Data Type) such as a sequence CRDT, which allows all replicas to accept writes immediately and converge to the same document state without coordination. The failure mode accepted: transient divergence — two users may see slightly different document states for 50–200 ms before convergence. This is acceptable and is the same model used by Google Docs and Figma. Latency on every keystroke must be sub-20 ms; strong consistency would require a round-trip quorum on every character, making the editor feel sluggish.
  2. Document version number → read-your-writes with optimistic locking. When a user saves a version explicitly, they must read back that version immediately to detect conflicts. Route post-save reads to the primary for a short window (2 s). Use an If-Match: version header on save operations so conflicting concurrent saves return 409 rather than silently overwriting each other. The failure mode accepted: two concurrent saves from different sessions could both read version N, both increment to version N+1, and produce a conflict. The conflict is surfaced as a 409, not silently swallowed.
  3. Billing credits → strong consistency (quorum or primary-only writes/reads). Billing credits are the one category where serving stale data causes real financial harm. If two AI suggestions read a credit balance of 5 simultaneously from different replicas, both proceed and decrement, resulting in the user going to −1 credits. Use W=N for credit decrements (write to all replicas synchronously, or write only to a strongly consistent store such as PostgreSQL with synchronous_commit = on). Reads route to the primary. The failure mode accepted: higher write latency for AI suggestions (50–150 ms extra) and reduced availability during a primary failure (suggestions pause until a replica is promoted).

Rubric: Full marks for correctly identifying that character insertions require AP semantics (not strong consistency), that billing requires CP semantics, and that the document version sits between the two with optimistic locking. Partial marks for two out of three correct with justifications. Bonus for naming CRDTs by type, or for explicitly stating that the billing failure mode (momentary 503) is preferable to the alternative failure mode (serving stale balance and allowing credit overdraft).

Key takeaways

Sources & further reading