API Design

Reliability & Scale · Lesson 17

High availability & redundancy

Availability is not something you bolt on after a system works — it is a structural property determined before you write a line of code. Every tier that has exactly one instance is a ticking clock.

⏱ 18 min Difficulty: advanced Prereq: Load Balancing, SLIs, SLOs & SLAs

By the end you'll be able to

Why one of anything is never enough

A restaurant that employs a single chef can produce excellent food — right up until the moment that chef calls in sick. On an ordinary Tuesday that risk feels theoretical. On a Friday night with a full reservation book it is a catastrophe. The restaurant's entire revenue-generating capability depends on one person who can fail.

The same logic applies to every component in a system. A single application server, a single database instance, a single load balancer — each one is a single point of failure (SPOF): a component whose failure causes the entire system to stop serving users. Redundancy is the practice of running multiple independent copies of each component so that one failing does not stop the others.

The critical insight is that redundancy must be applied at every tier independently. Deploying two application servers behind one load balancer does not make you highly available if the load balancer itself has no redundant peer. Eliminating SPOFs requires a systematic audit: compute, database, cache, load balancer, DNS, and the network paths between them.

Availability Zones vs Regions

Cloud providers organise their infrastructure into two nested scopes that matter for redundancy design.

An Availability Zone (AZ) is a physically separate datacenter — its own power grid, cooling plant, and network gear — located within a metropolitan area. AZs within the same region are connected to each other by low-latency private fiber (typically sub-2ms round-trip). The key word is physically separate: a power outage, hardware fire, or network failure in one AZ does not propagate to its siblings, because they share no physical infrastructure.

A Region is a geographically distinct cluster of AZs — for example, AWS us-east-1 contains six AZs spread across northern Virginia. Regions are separated by hundreds or thousands of kilometres. A single Region failing is rare and requires a large-scale event: a natural disaster, a major internet exchange outage, or a catastrophic DNS misconfiguration. When a Region goes down, recovery requires traffic to be routed to an entirely different Region.

ScopeWhat it protects againstTypical round-trip latency between peersCost multiplier
Multi-AZ Single datacenter failure (power, hardware, local network) < 2 ms Low — egress between AZs in the same region is cheap
Multi-Region Regional disasters, large-scale internet partition 50–200 ms (intercontinental) High — cross-region egress, data replication at scale

For most services, multi-AZ is the right baseline. Multi-region is reserved for services whose SLO cannot be met by multi-AZ alone, or where latency to global users is a primary concern.

Active-active vs active-passive

Once you have redundant nodes, you need a strategy for how traffic is distributed across them — and what happens when one fails.

Active-active means all nodes serve live traffic simultaneously. The load balancer routes requests across every healthy instance. When one node fails, the load balancer's health check detects the failure and stops sending it new connections. No promotion step is needed — the other nodes simply absorb the removed capacity. The Recovery Time Objective (RTO) equals the health check detection window (typically 10–30 seconds). The trade-off: all active nodes must be capable of handling writes, which requires careful consistency management.

Active-passive means one node (the primary) handles all traffic while a standby node waits idle, continuously receiving replicated data but not serving requests. When the primary fails, a failover controller promotes the standby to primary and redirects traffic to it. This promotion step adds 20–60 seconds to recovery for typical managed databases (e.g., Amazon RDS Multi-AZ). The advantage: only one node ever accepts writes, which makes consistency far simpler to reason about.

PropertyActive-activeActive-passive
Traffic during normal operation Spread across all nodes All on primary; standby idle
RTO on failure Seconds (health check interval) 30–60 s (detection + promotion + DNS)
Write consistency Complex: concurrent writers require conflict resolution Simple: single writer at all times
Resource utilisation High: all nodes are productive Low: standby capacity sits idle
Best for Stateless application servers, read-heavy databases with replicas Primary databases, services where split-brain is catastrophic

Automatic failover driven by health checks

Redundancy is inert without a mechanism to detect failure and reroute traffic. That mechanism is the health check. See Load Balancing (Lesson 08) for how load balancers use health checks to remove unhealthy nodes from rotation; this section focuses on what a health check should test.

Kubernetes exposes two probes with distinct jobs:

Getting these backwards is a common mistake. Configuring a failing readiness condition as a liveness probe causes cascading restarts instead of graceful traffic shedding — every pod that is momentarily overloaded gets killed, making the overload worse.

# Kubernetes pod spec — both probes on a typical API server
livenessProbe:
  httpGet:
    path: /health/live    # returns 200 if process is alive; 503 to trigger restart
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3     # 3 consecutive failures → restart

readinessProbe:
  httpGet:
    path: /health/ready   # checks DB conn, cache, downstream deps
    port: 8080
  periodSeconds: 5
  failureThreshold: 2     # 2 failures → remove from LB rotation
  successThreshold: 1     # 1 success → return to rotation

The health check's failure threshold and period together determine your detection window: with periodSeconds: 5 and failureThreshold: 3, the load balancer waits up to 15 seconds before removing a node. Tune these against your SLO's error budget — the detection window is unavoidable downtime for users routed to the failing instance.

Multi-region

Multi-AZ protects against a single datacenter failing. For events that affect an entire metropolitan area — a regional internet exchange failure, a natural disaster, a large-scale cloud-provider incident — you need traffic to route to an entirely different region.

Multi-region deployments provide two distinct benefits:

Both benefits come with costs. Data written in one region must be replicated to other regions, introducing replication lag. Depending on consistency requirements, reads from a secondary region may see stale data — the same tension explored in Consistency & the CAP theorem (Lesson 16). Multi-region active-active is the hardest configuration to operate: writes must be globally replicated, conflicts must be detected and resolved, and the system must handle network partitions between regions gracefully.

Mapping the nines to architecture

An availability target of "99.9%" is not a single architectural decision — each additional nine requires a qualitatively different architecture, not just more instances. The nines encode structural requirements:

TargetDowntime budget / yearArchitecture required
99% (two nines) ≈ 3.65 days Single instance. Maintenance windows, ad-hoc restarts.
99.9% (three nines) ≈ 8.76 hours Replicated instances behind a load balancer, still within one AZ. Planned maintenance must be zero-downtime.
99.99% (four nines) ≈ 52.6 minutes Multi-AZ with automatic failover. Requires multi-AZ database (e.g. RDS Multi-AZ), AZ-aware load balancer, and health-check-driven failover completing in under a minute.
99.999% (five nines) ≈ 5.26 minutes Multi-region active-active. Requires global traffic management (Anycast or GeoDNS), synchronous or near-synchronous cross-region replication, and automated regional failover. Extremely expensive to operate.

These budgets assume failures are independent. If your database and your application server share the same AZ and the AZ fails, you've consumed both components' downtime simultaneously. This is why multi-AZ must be applied to every tier — a multi-AZ application server cluster backed by a single-AZ database still has a single-AZ SPOF.

Tie availability targets to your error budget framework — see SLIs, SLOs & SLAs (Lesson 10). The error budget is the arithmetic expression of the nines, and it is what your team burns through during an incident.

Multi-AZ and multi-region topology

Internet / Clients Region A (us-east-1) Load Balancer AZ-1 App Server DB Primary writes here sync AZ-2 App Server DB Standby ready to promote async replication Region B (eu-west-1) Load Balancer AZ-3 App Server DB Replica reads + promote sync AZ-4 App Server DB Replica reads + promote LB / app traffic DB replication Primary DB Standby / Replica
Region A runs a multi-AZ active-passive database (primary in AZ-1, synchronously replicated standby in AZ-2). Region B holds async read replicas. Cross-region async replication provides disaster recovery but may lag by seconds.

Under the hood: a failover traced second by second

Abstract descriptions of failover hide the details that matter operationally. Walk through exactly what happens when AZ-1's application server stops responding in a multi-AZ active-passive setup:

  1. t = 0 s — Instance stops responding. The application server in AZ-1 crashes (kernel panic, OOM kill, power failure). Existing in-flight HTTP connections hang. New connection attempts time out.
  2. t = 0–15 s — Health probes time out. The load balancer sends HTTP GET /health/ready probes at 5-second intervals. All three return no response (connection refused or timeout). Each failure increments the LB's unhealthy threshold counter.
  3. t = 15 s — LB removes the instance from rotation. After three consecutive failures (threshold reached), the LB marks the instance unhealthy and stops routing new connections to it. Existing connections that were mid-flight get a TCP RST from the LB or time out on the client side.
  4. t = 15 s+ — AZ-2 absorbs all traffic. Every new request goes to healthy AZ-2 app servers. Users experience elevated latency (AZ-2 may be slightly more loaded) but requests succeed.
  5. t = 16 s — DB failover initiates. If active-passive: the DB failover controller (or AWS RDS Multi-AZ) detects the primary in AZ-1 is unreachable. It begins promoting the AZ-2 standby to primary. This promotion takes 20–60 seconds for RDS.
  6. t = 45 s — DNS TTL expires; new connections resolve to the promoted standby. RDS updates the CNAME record for the DB endpoint to point to the new primary. Clients whose DNS TTL has expired re-resolve and connect to AZ-2's new primary. Clients that held connections to the old primary receive TCP RST and must reconnect. Connection pool libraries typically handle this transparently with retry logic.

Total RTO in this scenario: roughly 45–60 seconds. The error budget consumed depends on how many requests were in-flight during the 15-second detection window and whether connection pools retried transparently.

# AZ-1 failure — failover event log (truncated timestamps) 10:42:00.001 WARN lb probe GET /health/ready → TIMEOUT instance=az1-app-01 attempt=1/3 10:42:05.003 WARN lb probe GET /health/ready → TIMEOUT instance=az1-app-01 attempt=2/3 10:42:10.004 WARN lb probe GET /health/ready → TIMEOUT instance=az1-app-01 attempt=3/3 10:42:10.005 ERROR lb instance az1-app-01 marked UNHEALTHY — removed from rotation 10:42:10.006 INFO lb active targets: [az2-app-01, az2-app-02] (was: [az1-app-01, az2-app-01, az2-app-02]) 10:42:11.200 WARN rds failover-controller primary az1-db-01 unreachable — initiating promotion 10:42:11.201 INFO rds failover-controller promoting standby az2-db-01 to primary 10:42:38.900 INFO rds failover-controller az2-db-01 promoted to primary elapsed=27.7s 10:42:38.901 INFO rds dns CNAME mydb.cluster.rds.amazonaws.com → az2-db-01.rds.amazonaws.com TTL=5s 10:42:43.500 INFO app az2-app-01 connection-pool reconnected to new primary pool_size=20

Failover timeline

t=0 5 s 10 s 15 s 45 s 60 s Instance fails probe 1 timeout probe 2 timeout probe 3 timeout LB removes node DB promotion window (~30 s) DNS cutover new primary Fully recovered Failure event Health probe DB promotion Recovery event
From instance failure to full recovery: the 15-second detection window is unavoidable with a 5 s probe interval and threshold of 3. DB promotion adds another 30 s; DNS propagation adds up to another 15 s depending on TTL.

How to operate it: symptom to fix

SymptomRoot causeFix
Failover didn't trigger — traffic stuck on unhealthy node Health check threshold too high, or check is too shallow (TCP port open only, not HTTP /health) Reduce failure_threshold to 3; upgrade check from TCP to HTTP and add a test query to the database inside the handler. A node that can accept a TCP connection but cannot reach its database is unhealthy.
Split-brain — two nodes both believe they are the primary database Network partition caused both the primary and standby to independently declare themselves primary (each believed the other was dead) Use quorum-based fencing (STONITH — Shoot The Other Node In The Head) or a distributed lease backed by etcd or ZooKeeper. Only the node holding the lease may accept writes. The node that loses the lease must fence itself before the other promotes.
Health check passes but requests still fail Shallow check returns HTTP 200 on a static /ping route that does not exercise the dependency chain. The database connection pool is exhausted; the /ping handler doesn't touch it. Deep health check: open and close a real database connection, assert the cache is reachable with a PING, call critical downstream APIs with a known-good probe request. Return 503 if any dependency is unavailable.
Cross-AZ latency spike causing read timeouts App servers in AZ-2 are sending all read queries to the primary database in AZ-1. Each read crosses the AZ boundary, adding ~2 ms round-trip. Under load, this saturates the primary's network interface. Deploy a read replica in each AZ. Route read queries to the local replica; reserve cross-AZ traffic for writes to the primary. Co-locate write-heavy services with the primary AZ to minimise cross-AZ write latency.
# Deep health check endpoint — what the handler should verify GET /health/ready 200 OK {"db":"ok","cache":"ok","downstream":"ok","latency_ms":3} # What a failing deep check looks like: 503 Service Unavailable {"db":"timeout","cache":"ok","downstream":"ok"} # → LB removes this pod from rotation; other pods still serve # Shallow check — gives false confidence: 200 OK {"status":"ok"} ← returns 200 even when DB pool is exhausted ERROR app query failed: connection pool exhausted (pool_size=20, waiting=47)

Interview signal

In system design interviews, "add redundancy" is not a complete answer. Interviewers want to see three things: which tier has the SPOF (compute, database, load balancer, DNS), how failover is detected (what is the health check checking, what is the failure threshold, what is the detection window), and what the RTO of each component is. A strong answer maps the claimed availability target directly to the architecture: "We want 99.99%, which means 52 minutes of downtime budget per year. A 60-second RTO from multi-AZ DB failover consumes ~8 minutes of that budget per incident. We can sustain roughly six such incidents per year before breaching the SLO."

Pitfall: shallow health checks

A server returning HTTP 200 on /ping while its database connection pool is exhausted will be kept in rotation, serving errors to every user. Shallow checks — TCP port open, static JSON response, anything that does not exercise the real dependency chain — give false confidence. Your readiness endpoint must open a real database connection, verify cache reachability, and call any downstream APIs your handlers depend on. If any of those fail, return 503 immediately. The LB will do the right thing; you just have to tell it the truth.

Test failover in production, regularly

Test failover in production on a schedule — not just at launch and not only in staging. Scheduled fire drills, deliberately removing one AZ from rotation during low-traffic hours, reliably expose hidden SPOFs before an actual outage does. Common finds: a service that hardcoded a single AZ's internal endpoint rather than using the DNS-based load-balanced address; a DNS record with a 1-hour TTL that means "failover" takes an hour; a database connection string that bypasses the CNAME and points directly to the primary's IP. None of these appear in unit tests. All of them appear in fire drills.

By the numbers

Make it concrete. The service is a B2B SaaS API targeting 99.99% annual availability (four nines). Every architectural decision below is evaluated against that budget.

Series vs. redundancy: the availability math

When components are in series (all must be up for the service to work), their availabilities multiply — and the product is always worse than the weakest component:

A_series = a₁ × a₂ × … × aₙ

3 components each at 99.9%:
A_series = 0.999 × 0.999 × 0.999 = 0.997 = 99.7%   ← WORSE than any individual component

The lesson: a three-tier stack (load balancer + app server + database) that runs only one instance of each tier is limited to 99.7% even if every component is individually 99.9%. Redundancy is the fix.

When components are redundant (any one of k instances can fail and the others take over), the combined availability is:

A_redundant = 1 − (1 − a)^k      (assumes independent failures)

Two 99% instances (k = 2):
A_redundant = 1 − (1 − 0.99)² = 1 − 0.0001 = 0.9999 = 99.99%

Two mediocre 99% instances in redundancy deliver better availability than one excellent 99.9% instance alone. This is why redundancy is the primary lever, not individual component reliability.

Nines → downtime/year table

TargetAnnual downtime budgetMonthly budget (30 d)Architecture tier required
99% (two nines)3.65 days7.2 hoursSingle instance; maintenance windows allowed
99.9% (three nines)8.76 hours43.2 minutesReplicated instances behind a load balancer (single AZ)
99.99% (four nines)52.6 minutes4.4 minutesMulti-AZ with auto-failover; every tier redundant
99.999% (five nines)5.26 minutes26 secondsMulti-region active-active; global traffic management

Memory aid: each additional nine cuts the annual downtime budget by 10×. Going from three nines to four nines shrinks your budget from 8.76 hours to 52.6 minutes — a single careless deployment or a 45-minute database failover can consume the entire year's allowance in one incident.

Failover budget: does it fit the SLO?

For a four-nines target (52.6 min/year), trace the exact failover timing for an active-passive multi-AZ setup with periodSeconds: 5 and failureThreshold: 3:

Detection window: 3 failed checks × 5 s = 15 s (requests to failing node fail during this window) LB reroute: ~1 s (LB marks node unhealthy, stops new connections) DB promotion: ~28 s (RDS Multi-AZ standby promoted — see log above) DNS TTL: 5 s (RDS CNAME update propagates) Pool reconnect: ~6 s (connection-pool clients re-resolve and reconnect) Total RTO: 15 + 1 + 28 + 5 + 6 ≈ 55 s per incident

Does it fit? The annual budget for 99.99% is 52.6 minutes = 3,156 seconds. One incident burns 55 s, so you can sustain at most ⌊3,156 ÷ 55⌋ = 57 incidents per year (about one per week) before breaching the SLO. That seems comfortable — but any incident with manual steps, a slow DNS TTL, or a multi-step runbook that takes 5 minutes instead of 55 seconds would consume 5.5 incidents' worth of budget in a single event.

Decision math: how many redundant instances to hit a target

You need each tier to contribute negligible failure probability. Solving A_redundant ≥ A_target for k:

k ≥ log(1 − A_target) ÷ log(1 − a)

Target A = 99.99%, component a = 99.9%:
k ≥ log(1 − 0.9999) ÷ log(1 − 0.999)
  = log(0.0001) ÷ log(0.001)
  = (−4) ÷ (−3)
  ≈ 1.33  →  round up to k = 2 instances per tier

Two 99.9% instances per tier (compute + database + load balancer) combine for: A = 1 − (0.001)² = 99.9999% per tier, well above the 99.99% target. The series product of three tiers is then 0.999999³ ≈ 99.9997% — still above four nines. This is why multi-AZ active-passive with two nodes per tier is the standard four-nines architecture.

Check your understanding

Which statement best describes an Availability Zone?

AZs are physically isolated datacenters — separate power, networking, and cooling — but within the same metro area. That physical separation means a hardware failure or power outage in one AZ does not affect others. The low-latency private fiber between AZs is what makes synchronous database replication practical within a region.

Active-passive failover trades lower RTO for simpler consistency. Which of the following is a correct statement about active-active failover?

In active-active, every node is live. When one fails, the load balancer simply stops sending it new connections — no promotion required. The trade-off is that all active nodes must handle writes, which raises consistency complexity. DNS TTL affects recovery speed but is not a prerequisite for active-active.

A service achieves 99.9% availability. Approximately how much downtime is that per year?

99.9% means 0.1% downtime. 0.001 × 365.25 × 24 = 8.77 hours per year. The memory aid: each additional nine cuts downtime by 10×. Going from 99.9% to 99.99% reduces the annual budget from ~8.76 hours to ~52.6 minutes. 52 minutes is four nines; 26 minutes is 99.995% (four and a half nines).

A readiness probe and a liveness probe serve different purposes in Kubernetes. Which statement is correct?

Readiness controls whether the pod is added to the Service's endpoint list — a failing readiness probe removes it from LB rotation without restarting it. This is the right mechanism during warm-up or when a downstream dependency is temporarily unavailable. Liveness controls whether kubelet should restart the container, used for deadlocks or unrecoverable states. Getting them backwards causes cascading restarts instead of graceful traffic shedding.

Exercise: design a multi-AZ failover for a checkout service

Scenario

A payment checkout service runs in a single AZ (us-east-1a). The database is a single PostgreSQL instance in the same AZ. Last month it suffered a 45-minute outage when a hardware failure took down the AZ. The team's SLO is 99.95% availability — approximately 26 minutes of downtime budget per year.

Design the architecture changes needed, specifying:

  1. How you achieve AZ redundancy for both the application tier and the database tier.
  2. What health checks you add and at what thresholds.
  3. What the expected RTO becomes after your changes.
  4. Whether 99.95% is now achievable for this failure mode.

Model answer:

(a) AZ redundancy. Deploy application servers in both us-east-1a and us-east-1b behind an Application Load Balancer (ALB). The ALB is itself multi-AZ by default — AWS provisions ALB nodes in each AZ. Use Amazon RDS Multi-AZ: the primary runs in us-east-1a and synchronously replicates to a standby in us-east-1b. AWS manages the standby — no application changes required for the DB tier. The standby promotes automatically within 30–60 seconds on primary failure.

(b) Health checks. Configure the ALB target group with an HTTP health check on GET /health/ready (expects HTTP 200). The handler verifies: a real database query (SELECT 1) responds within 500 ms, and any critical downstream APIs (fraud check, payment gateway status) are reachable. Thresholds: HealthyThresholdCount=2, UnhealthyThresholdCount=2, HealthCheckIntervalSeconds=5. This gives a 10-second detection window. RDS failover is managed by AWS automatically.

(c) Expected RTO. Application-tier detection: 10 s. RDS promotion: 30–60 s (AWS SLA is typically under 60 s). DNS propagation: RDS uses a CNAME endpoint; connection pool clients re-resolve within 60 s of the CNAME update (DNS TTL on RDS endpoints is 5 s, but client-side resolver caching varies). Worst-case RTO: approximately 2 minutes. Connection poolers (e.g., PgBouncer, RDS Proxy) can absorb the DB reconnection and reduce application-visible errors during this window.

(d) SLO achievability. A 2-minute incident consuming the full downtime budget requires 13 such incidents before breaching 99.95% (26 minutes/year budget). Hardware failures of this type occur at most a few times per year in a well-operated AZ. Yes — this architecture supports 99.95% for AZ-level hardware failures. Additional budget consumers to account for: planned maintenance windows (use multi-AZ rolling restarts), application deployment errors (canary deployments), and RDS minor version upgrades. With those controlled, 99.95% is achievable.

Rubric: Full marks for all four parts with realistic timing numbers. Partial marks for correct topology but no RTO calculation. Bonus for noting RDS Proxy or PgBouncer to absorb the DB reconnection window, or for noting that DNS TTL is a hidden delay in naive connection string configurations.

Key takeaways

Sources