API Design

Reliability & Scale · Lesson 18

Resilience & disaster recovery

A system that either works perfectly or falls over completely is not resilient — it is brittle. The goal is never zero failures; it is ensuring that when a dependency breaks, the rest of the system keeps moving at reduced capacity rather than stopping entirely.

⏱ 20 min Difficulty: advanced Prereq: High availability, Retries & backoff, Circuit breaker

By the end you'll be able to

The difference between availability and resilience

High availability is about surviving the failure of an infrastructure component — a crashed server, a flipped AZ, a misconfigured load balancer. Resilience is a different and complementary goal: surviving the misbehaviour of a dependency that is technically still alive. A downstream service that responds in 8 seconds instead of 80 milliseconds is not down — but a caller that holds a thread open waiting for it is being quietly strangled. Resilience patterns are the bulkheads, pressure relief valves, and blast walls that prevent one struggling component from sinking everything around it.

The underlying principle is graceful degradation: when a dependency is unavailable or overloaded, return a degraded but useful response rather than an error. Serve a cached product list when the inventory service is slow. Show a notification count of "99+" when the count service is down rather than failing the page load. Omit the recommendations widget when the ML service times out. Each of these is worse than the full experience — and each is vastly better than a blank error screen.

Bulkheads: containing blast radius

The name comes from ship design. A bulkhead is a watertight wall between compartments of a hull. When one compartment floods, the walls contain the water. The ship loses some buoyancy but does not sink. In software, a bulkhead is a resource isolation boundary that prevents one slow or failing dependency from consuming every connection, thread, or memory allocation in the process.

The most concrete implementation is separate connection pools per downstream dependency. Suppose a service has three dependencies: a primary database, a recommendations service, and a notification service. Without bulkheads, all outbound I/O shares one system-level connection pool or thread pool. If the recommendations service starts responding in 10 seconds, threads pile up waiting for it. Soon the pool is exhausted — every request, including those that never touch recommendations, queues behind waiting threads. The primary database calls, which were fast, can no longer get a thread. The entire service becomes unresponsive because of a component it relies on for one optional feature.

With bulkheads: a pool of 20 threads handles recommendations. A separate pool of 50 handles primary database calls. The notification pool has 10. When recommendations stalls, it fills its own 20-thread pool and begins rejecting new requests to that dependency. The primary database pool is untouched. The service degrades (no recommendations) but continues serving its core function.

Without bulkheads Shared thread pool (80) Primary DB ✓ Notifications ✓ Recommendations 10 s response All 80 threads stuck waiting → entire service unresponsive With bulkheads DB pool (50 threads) Notif pool (10 threads) Rec pool (20 threads) Primary DB ✓ Notifications ✓ Recommendations 10 s response 20 rec threads full → reject, return fallback DB + Notif pools unaffected → core service continues
Left: a shared thread pool lets one slow dependency exhaust all threads, bringing the entire service down. Right: separate pools per dependency contain the failure — only the recommendations pool fills; the primary database pool is untouched.

Load shedding: choosing which work to drop

When a service is overloaded it has three options: accept all incoming requests and serve them slowly (adding to the queuing problem), crash, or deliberately reject a fraction of incoming work to protect its ability to serve the rest. Only the third is resilient. Load shedding is the deliberate act of returning a fast error — typically HTTP 503 or 429 — to some requests so that others complete successfully.

Stripe's engineering team has written publicly about their load shedding approach (see sources). The key insight is that not all requests have equal priority. A request to process a payment must be attempted even under extreme load. A request to fetch the account dashboard can be dropped — the user gets a "try again" message, which is frustrating but not catastrophic. Load shedding works by classifying requests by priority at the edge and returning 503 to lower-priority traffic once a configurable concurrency threshold is exceeded. See Rate Limiting Algorithms (Lesson 03) for the algorithmic primitives (token bucket, concurrency limiters) used to implement the concurrency threshold.

Backpressure: pushing the signal upstream

Load shedding says "I am too busy; go away." Backpressure is a related idea applied to internal queues and streaming systems: rather than accepting work into an unbounded queue and hiding the overload, the system signals congestion to the sender so the sender can slow down, instead of the queue growing without bound.

In a producer-consumer pipeline, the concrete mechanism is a bounded queue with a finite capacity. When the queue is full, the producer's call to enqueue() blocks or returns an error immediately, instead of succeeding. The producer receives immediate signal that the consumer is not keeping up. It can either shed that unit of work (drop the message, return an error to its caller) or apply backpressure upstream — block its own caller until the queue drains. This is how TCP flow control works: when the receive buffer fills, the TCP window shrinks to zero, stopping the sender at the transport layer. Designing application-level queues the same way prevents the classic failure mode where a slow consumer causes unbounded memory growth until the process OOMs and crashes.

Timeouts everywhere

Every outbound network call must have a timeout. A missing timeout converts a slow dependency into a thread that waits indefinitely, which is the entry point to the bulkhead failure described above. "Everywhere" is not rhetorical: the HTTP client to the upstream service, the database connection, the cache client, the DNS lookup inside the HTTP client, and the TCP connection establishment all need individual timeouts. Default settings for many HTTP client libraries are either infinite (no timeout at all) or dangerously long (60+ seconds). Set them explicitly at a level consistent with your SLO.

Timeouts should be paired with retries and circuit breakers. A single timeout → immediate retry is often correct for transient network glitches, but must be bounded: see Retries & Exponential Backoff (Lesson 05) for why unbounded retries amplify load on a struggling service. When a dependency is persistently slow or failing, the circuit breaker pattern — see Circuit Breaker Pattern (Lesson 06) — stops calling it altogether for a recovery window, which both protects the dependency and frees your threads immediately.

Disaster recovery: RTO and RPO

Resilience patterns protect against transient and partial failures. Disaster recovery (DR) addresses a different scenario: the data is gone, or the entire system is destroyed, and you must rebuild. Two numbers define the envelope of that rebuild:

Recovery Time Objective (RTO) is the maximum elapsed time from the moment of a disaster until the service is restored to full operation. RTO is a contractual commitment about speed of recovery. An RTO of 4 hours means users can expect service to resume within 4 hours of a declared disaster — and the company has committed to investing whatever infrastructure is necessary to meet that. An RTO of 15 minutes implies active hot-standby infrastructure; an RTO of 24 hours may allow cold restore from nightly backups.

Recovery Point Objective (RPO) is the maximum amount of data loss the business can tolerate, measured in time. An RPO of 1 hour means you are prepared to lose up to one hour of transactions. If a disaster occurs at 14:37 and your last good backup is from 13:45, you have lost 52 minutes of data — inside your RPO. An RPO of zero means no data loss is acceptable, which requires synchronous replication (writing is not acknowledged until it is confirmed on the replica). An RPO of 24 hours allows nightly backups with no real-time replication.

RTORPOTypical architectureExample
< 1 min ~0 s Multi-region active-active; synchronous cross-region replication Real-time payment processing
1–15 min < 1 min Multi-AZ with automatic failover; async replication with low lag SaaS API serving thousands of tenants
1–4 hours 1 hour Warm standby in a second region; hourly snapshot backups Internal analytics platform
24 hours 24 hours Cold restore from nightly backups; manual runbook-driven recovery Archive or audit-log service

The lower the RTO and RPO, the higher the infrastructure cost. A zero-RPO system must acknowledge every write only after it has been confirmed on a synchronous replica — which adds the network round-trip to the replica to every write latency. This is the same consistency–availability tension from Consistency & the CAP theorem (Lesson 16), expressed in DR terms.

Backups and restore drills

A backup that has never been restored is a hypothesis, not a guarantee. The most common disaster-recovery failure mode is not the absence of backups; it is discovering during an actual incident that the restore procedure is broken, the backup files are corrupt, or the process takes twice as long as the RTO allows. The only fix is to run restore drills on a schedule — quarterly at minimum — and to treat a successful restore as a measurable metric, not a one-time event.

Backup strategy must match the RPO. Point-in-time recovery (PITR) — continuous archival of write-ahead log segments, as supported by PostgreSQL and Amazon RDS — allows restoring to any second within the retention window, not just the last snapshot. For an RPO of 5 minutes, PITR is the right tool; an hourly snapshot leaves a 55-minute window of potential loss.

Runbooks are as important as the infrastructure. A runbook is a step-by-step operating procedure for a specific failure scenario: what to verify, what commands to run, in what order, and how to confirm the restore succeeded. The runbook exists so that an on-call engineer who has never personally performed a full restore can do so at 3 AM without improvising. Keep runbooks version-controlled, link them to the relevant monitoring alerts, and review them after every drill that surfaces a gap.

Chaos engineering: proving resilience by breaking things

Code review and load testing check that a system behaves correctly under expected conditions. Chaos engineering — deliberately injecting failures into a running system — checks that the resilience patterns actually fire as designed. Netflix's Chaos Monkey is the most famous example: it terminates random production instances to verify that the rest of the fleet handles the loss gracefully, and to surface any implicit assumptions about specific instances always being available (see sources).

Chaos experiments are not random destruction. A well-designed experiment has a hypothesis ("if the recommendations service becomes unavailable, the homepage still loads and shows a static fallback"), a defined blast radius (only one of five recommendation servers, or only in a canary deployment group), and measurable criteria for success. The goal is to find failures in a controlled way before a real incident finds them in an uncontrolled way. Common experiments: terminate random instances, inject network latency (e.g., 300 ms added to all DB calls), exhaust a specific connection pool, simulate a slow DNS response, or drop packets between two services.

Start with a hypothesis, run in non-production first, and verify the hypothesis is confirmed before running in production at limited blast radius. Never run chaos experiments during peak traffic hours — the goal is discovery, not self-inflicted incidents.

Graceful degradation decision flow

Request arrives Circuit breaker open? (has dependency been failing?) Yes Skip call entirely → go to fallback check No Call dependency Responds within timeout? (or returns success) Yes Full response ✓ 200 OK No Fallback available? Yes Return degraded result No Fail fast: 503
The degradation decision tree. The circuit breaker short-circuits failed calls entirely, immediately routing to the fallback path. When a fallback exists — cached data, a default value, a partial response — the caller receives a degraded but useful answer. Only when no fallback is possible should the service fail fast with 503.

Under the hood: bulkhead execution — a traced example

Here is exactly what happens inside a Java or Go service that uses separate semaphore-based connection pools as bulkheads. The mechanism is a counting semaphore: each pool has a max-concurrency value. Each outbound call must acquire a permit before sending the request. When all permits are held by in-flight calls, new callers either block for a maxWait duration or receive an immediate rejection. This is the moment that contains the blast radius — the caller can immediately return the fallback instead of hanging.

// Conceptual bulkhead using a semaphore (Go pseudocode)
type Bulkhead struct {
    sem    chan struct{}  // buffered channel as a counting semaphore
    timeout time.Duration
}

func NewBulkhead(maxConcurrency int, timeout time.Duration) *Bulkhead {
    return &Bulkhead{
        sem:     make(chan struct{}, maxConcurrency), // capacity = max permits
        timeout: timeout,
    }
}

func (b *Bulkhead) Execute(ctx context.Context, fn func() (interface{}, error)) (interface{}, error) {
    // Attempt to acquire a permit within the timeout
    select {
    case b.sem <- struct{}{}:  // slot acquired
        defer func() { <-b.sem }()  // release on return
        return fn()
    case <-time.After(b.timeout):
        // Bulkhead saturated — return immediately, never hang
        return nil, ErrBulkheadFull  // caller routes to fallback
    case <-ctx.Done():
        return nil, ctx.Err()
    }
}

// Usage: separate bulkheads per dependency
var (
    dbBulkhead  = NewBulkhead(50, 100*time.Millisecond)   // primary DB
    recBulkhead = NewBulkhead(20,  50*time.Millisecond)   // recommendations
    ntfBulkhead = NewBulkhead(10,  50*time.Millisecond)   // notifications
)

// In the request handler:
result, err := recBulkhead.Execute(ctx, func() (interface{}, error) {
    return fetchRecommendations(userID)
})
if errors.Is(err, ErrBulkheadFull) {
    recommendations = defaultRecommendations()  // fallback — never blocks
}

The trace of the failure scenario from the diagram above:

14:05:00.001 INFO bulkhead=recommendations acquired permit in_flight=1/20 14:05:00.150 INFO bulkhead=recommendations acquired permit in_flight=8/20 14:05:03.200 WARN bulkhead=recommendations acquired permit in_flight=20/20 14:05:03.251 ERROR bulkhead=recommendations FULL wait timeout exceeded → returning fallback 14:05:03.252 INFO response 200 OK recommendations=fallback(static) latency_ms=12 14:05:03.253 INFO bulkhead=db acquired permit in_flight=3/50 # DB unaffected 14:05:03.260 INFO response 200 OK recommendations=fallback(static) latency_ms=9 14:05:10.000 WARN circuit-breaker=recommendations open error_rate=95% window=10s 14:05:10.001 INFO bulkhead=recommendations circuit open: skipping call entirely 14:05:10.002 INFO response 200 OK recommendations=fallback(static) latency_ms=3

The key observation: the database pool's in_flight counter stays well below its capacity throughout. The degradation is contained entirely within the recommendations pool. Response times for pages that never touch recommendations are unaffected.

How to operate it: cascading-failure triage and RTO/RPO worked example

Cascading-failure triage table

SymptomRoot causeFix
All endpoints return 503 even though only one dependency is slow No bulkhead: the slow dependency's thread pool (or connection pool) is shared with all other work. All threads are held waiting. Isolate per-dependency connection/thread pools. Apply bulkhead pattern. Add per-dependency timeouts so threads are released after a bounded wait.
Service crashes under load during a downstream outage Retries amplify load: every in-flight request retries on timeout, multiplying the request rate at the moment the dependency is most stressed. Combined with no circuit breaker, the retry storm exhausts the caller's own resources. Cap total retries to 1–2 attempts with exponential backoff and jitter (see Lesson 05). Wire the circuit breaker (see Lesson 06) — once the error rate trips the breaker, retries stop entirely.
Graceful degradation returns stale cached data from 6 hours ago Cache TTL is too long; the fallback cache was populated once at service start and never refreshed because the dependency was always available in testing. Set a maximum staleness TTL on the fallback cache (e.g. 5 minutes). Use a background refresh pattern: proactively refresh the cache before it expires, not reactively when it misses. Test fallback paths explicitly in CI by mocking the dependency as unavailable.
Queue grows without bound during a consumer outage; process OOMs and crashes Queue is unbounded: the producer never applies backpressure. The consumer outage causes the producer to accumulate work indefinitely. Apply a bounded queue with a defined overflow policy: drop-oldest, drop-newest, or block-with-timeout. Return a backpressure signal (HTTP 503 or an application-level error) to the upstream caller. Alert when queue depth exceeds a threshold before it reaches the limit.
RTO is 4 hours on paper but actual restore took 11 hours during a drill Runbook references infrastructure that no longer exists; restore procedure was never timed; dependencies between restore steps were not documented. Run restore drills quarterly. Time every step. Update the runbook after every drill. Automate the restore procedure as a script with a dry-run mode so the steps are executable, not just documented.

RTO / RPO worked example

Scenario

A SaaS invoicing API stores invoices in PostgreSQL. Current state: nightly snapshot backups, no replication, single AZ. The business has accepted the following requirements: no more than 2 hours of data loss; service must be restored within 30 minutes after a full database failure.

RPO = 2 hours. Nightly snapshots (24-hour RPO) do not meet this. The right tool is continuous WAL archival to S3 (PITR on Amazon RDS), which allows restore to any point within the retention window. With PITR, worst-case data loss is the time since the last archived WAL segment, typically under 5 minutes — well inside the 2-hour RPO.

RTO = 30 minutes. A cold restore from an S3 backup to a new RDS instance typically takes 15–45 minutes depending on database size. For a 30-minute RTO, the warm-standby approach is safer: keep a pre-provisioned standby instance that continuously receives replicated data and can be promoted in under 2 minutes (Amazon RDS Multi-AZ achieves this). The standby instance runs in a different AZ and is available for promotion without requiring a full restore from backup.

Architecture decision. Enable RDS Multi-AZ (provides RTO of ~2 minutes, RPO of ~0 seconds for in-region failures) and continuous PITR with a 7-day retention (protects against accidental data deletion, logical corruption, and events that damage both AZ copies). Run a restore drill quarterly: restore from PITR to a new RDS instance, verify row counts and referential integrity, measure the elapsed time, and update the runbook.

# RDS Multi-AZ failover — expected timing 14:11:00.000 ERROR rds primary az1-db unreachable 14:11:00.100 WARN rds initiating automatic failover to standby az2-db 14:11:28.700 INFO rds az2-db promoted to primary elapsed=28.6s 14:11:28.701 INFO rds dns CNAME updated TTL=5s 14:11:35.200 INFO app connection-pool reconnected to new primary # Total application-visible downtime: ~35 seconds # RTO achieved: < 1 minute — well within the 30-minute target # PITR restore drill — from a specific timestamp aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier invoicing-prod \ --target-db-instance-identifier invoicing-restore-drill \ --restore-time 2026-06-19T13:45:00Z # Restore status: available after ~18 minutes for a 40 GB database INFO drill row count invoices=1842317 expected=1842317 ✓ INFO drill referential integrity check passed ✓ INFO drill elapsed=18m 24s RTO target=30m ✓
🎯 Interview angle

When asked "how would you handle a dependency going down?", weak answers name the circuit breaker and stop. A strong answer covers the full stack: the timeout that detects the problem, the bulkhead that contains the thread exhaustion, the circuit breaker that stops new calls, and the fallback that returns a degraded response. Then address DR separately: define the RTO and RPO requirements, and show you understand that those numbers determine the backup strategy — not the other way around. Finally, mention chaos engineering as the only way to know the resilience patterns actually work.

⚠️ Common trap: retrying without a circuit breaker

Retries on their own are not resilience — they are load amplification during outages. Consider a service with 1,000 requests/second and a dependency that goes down. Without a circuit breaker, every request retries twice, tripling the load to 3,000 RPS on a service that cannot handle 1,000. The retry storm overwhelms the recovery attempt and extends the outage. Retries must always be paired with a circuit breaker that trips after a threshold of failures and stops calling the dependency entirely for a recovery window. The circuit breaker is what converts "retry amplifies the problem" into "retry handles transient glitches while the circuit protects against sustained failures."

✅ Instrument everything before injecting chaos

Chaos engineering without observability is just outage simulation. Before running any chaos experiment, confirm you have metrics for the thing you are testing: queue depth, bulkhead saturation rate, circuit breaker state transitions, fallback activation rate, and end-to-end error rate by endpoint. Without these, a chaos experiment tells you "something broke" but not whether the resilience pattern fired correctly. With them, you can verify the exact hypothesis: "bulkhead saturation rate for recommendations spiked to 100% at t+3s, primary DB error rate stayed at 0%, fallback activation rate was 100% — pattern worked as designed."

By the numbers

Make it concrete. The service is an e-commerce checkout API. It calls a Recommendations dependency at 500 QPS; each call takes on average 200 ms. The checkout database holds 500 GB and is backed up to S3. Workers are CPU-bound and run at roughly 70% utilisation during normal traffic.

Bulkhead pool sizing: Little's Law

The right pool size for a dependency equals the peak number of concurrent in-flight calls — anything larger wastes memory; anything smaller starves throughput. From Little's Law (L = λ × W):

pool_size = QPS_dep × latency_dep (in seconds)
          = 500 req/s × 0.200 s
          = 100 concurrent connections

Cap the Recommendations pool at 100. When the dependency slows from 200 ms to 2,000 ms under load, the same formula tells you the pool would need 500 × 2.0 = 1,000 slots — but by capping at 100, new calls fail fast after maxWait = 50 ms and return the fallback. The primary DB pool (capped separately at 50) is never touched. Stripe's public architecture (Stripe Engineering — Scaling your API with rate limiters) uses exactly this pattern: separate concurrency limiters per dependency type.

Normal (200 ms latency): 500 QPS × 0.200 s = 100 in-flight → pool at 100 = just enough Degraded (2,000 ms): 500 QPS × 2.000 s = 1,000 needed → pool capped at 100 = 900 fast-fail → fallback DB pool: unaffected — separate semaphore, never starved

RPO: backup interval determines maximum data loss

RPO = backup interval for snapshot-based backups. With hourly snapshots, a disaster at 14:58 hits against the 14:00 snapshot — up to 58 minutes of data loss. To meet an RPO of 5 minutes, you need continuous WAL archival (PITR), where each WAL segment is shipped to S3 every 60–300 seconds, limiting loss to that shipping interval (PostgreSQL PITR documentation).

Backup methodBackup intervalMaximum data loss (RPO)Use case
Hourly snapshots60 minUp to 60 minInternal analytics, low-churn data
15-min snapshots15 minUp to 15 minModerate-value operational data
PITR (WAL shipping)~5 min WAL segmentsUp to ~5 minSaaS, financial, most production APIs
Synchronous replica0 (commit-synchronous)~0 sPayments, ledgers, zero-loss systems

RTO: how long does a restore actually take?

The checkout database is 500 GB. A cold restore from S3 to a new RDS instance at AWS S3-to-RDS throughput of roughly 100 MB/s takes:

restore_time = database_size ÷ restore_throughput
             = 500 GB × 1,024 MB/GB ÷ 100 MB/s
             = 512,000 MB ÷ 100 MB/s
             ≈ 5,120 s  ≈  85 minutes

An 85-minute cold restore cannot meet a 30-minute RTO. The fix is a warm standby (RDS Multi-AZ), where promotion — not full restore — takes 28–60 seconds. Use PITR as the safety net for logical corruption; use the warm standby for RTO. Both run together (AWS RDS PITR docs).

Cold restore (500 GB at 100 MB/s): ≈ 85 min — FAILS a 30-min RTO Warm standby promotion (RDS Multi-AZ): ≈ 28–60 s — meets a 30-min RTO with headroom PITR from specific timestamp: ≈ 18 min — safe fallback for corruption events

Load-shedding threshold: when to start dropping work

Workers run at 70% CPU utilisation at normal load. The load-shedding threshold is set at 85% worker utilisation. Below 85%: all requests proceed. At or above 85%: low-priority requests (recommendations, dashboard analytics) receive HTTP 503 immediately; high-priority requests (checkout, payment) continue. This protects the 15% headroom needed for retries and GC pauses (AWS Builders' Library — load shedding).

headroom = 1 − shed_threshold = 1 − 0.85 = 15%
normal_load = 70% utilisation → 15% gap before shedding triggers
at 85% shed begins → high-priority work still completes → system stays stable

Decision math: choosing pool size, backup interval, and RTO feasibility

Given dependency QPS and latency, the formula is: pool_size = QPS × latency_s. Round up to the nearest 10 and add 10% headroom for burst. For RPO: pick the backup method whose interval fits inside the RPO. For RTO: compare database_size ÷ restore_rate against the RTO target — if it exceeds it, you need a warm standby, not cold restore. For load shedding: set the threshold at 1 − GC_headroom − retry_headroom, typically 80–85%.

🧠 Quick check

1. A bulkhead in software most directly prevents:

The bulkhead pattern isolates resource pools (threads, connections) per dependency. When the recommendations service stalls and fills its dedicated pool, the database pool — which belongs to a completely separate bulkhead — is unaffected. The failure is contained within the recommendations compartment.

2. An RPO of 15 minutes means:

RPO (Recovery Point Objective) measures data loss tolerance in time. An RPO of 15 minutes means if disaster strikes, you may need to restore from a backup that is up to 15 minutes old — and the business has accepted that 15 minutes of transactions may need to be replayed or accepted as lost. RTO (Recovery Time Objective) is about speed of recovery, not data loss.

3. Which technique directly applies backpressure to producers?

A bounded queue is the direct implementation of backpressure. When the consumer falls behind and the queue fills to capacity, the next producer call to enqueue() either blocks until space is available or fails immediately — giving the producer a signal to slow down or shed that unit of work. An unbounded queue never sends this signal; the queue grows until memory is exhausted.

4. Why must chaos experiments have a defined hypothesis before running?

Without a hypothesis ("if X fails, Y continues and Z activates"), a chaos experiment produces noise, not signal. The hypothesis defines the success criteria: which metric should spike (bulkhead saturation, fallback activation), which metric should stay flat (primary DB error rate), and what the overall outcome should be. The experiment either confirms the hypothesis — proving the resilience pattern works — or refutes it, revealing a gap to fix before a real incident does.

5. Load shedding differs from rate limiting primarily in that load shedding:

Rate limiting enforces a per-client quota over time (e.g. 100 requests/minute per API key). Load shedding responds to the current load level of the service itself: when the server's concurrency exceeds a threshold, it classifies incoming requests by priority and drops the least critical ones — regardless of which client sent them. The goal is to protect server capacity, not to enforce fairness between clients.

✍️ Exercise: design a resilient checkout service
Scenario

An e-commerce checkout service calls three downstream services on every request: Inventory (checks if items are in stock), Fraud Detection (scores the transaction risk), and Recommendations (shows "you might also like" items). The service currently uses a single shared HTTP client with no timeout, no bulkheads, and no circuit breakers. The Fraud Detection service has been intermittently slow (8–15 second response times) for the past week, causing checkout to become unresponsive during those windows.

Design the resilience improvements, specifying: (a) which patterns to apply to each dependency, (b) what the fallback for each is, (c) the RTO and RPO you would target for the checkout database, and (d) how you would prove the improvements work.

Model answer:

(a) Patterns per dependency:

(b) Fallbacks: Inventory has no safe fallback (overselling inventory is a business problem). Fraud Detection fallback is accept-and-flag. Recommendations fallback is a cached static list.

(c) RTO and RPO: Checkout is revenue-critical. Target RTO = 5 minutes (automated failover; any manual step must complete within 5 minutes), RPO = 30 seconds (synchronous replication within a Multi-AZ database; up to 30 seconds of in-flight transactions may be at risk during a failover window). Implement RDS Multi-AZ for automated promotion and continuous PITR for point-in-time recovery of accidental data corruption.

(d) Proving it works: Write a chaos experiment for each resilience path: (1) introduce 8-second latency to Fraud Detection calls — verify checkout error rate stays at 0%, fraud_detection_fallback_rate metric rises to 100%, checkout latency stays under 2.5 s; (2) terminate the Recommendations service — verify recommendations_fallback_rate is 100% and response time is unaffected; (3) inject a DB failover — verify application-visible 503 errors are contained to under 35 seconds and connection pool reconnects without operator intervention. Run drills quarterly and measure elapsed time to recovery against the RTO target.

Rubric: Full marks for identifying correct patterns per dependency with justified timeouts and clear fallbacks. Partial marks for correct patterns without fallback design. Bonus for noting the asymmetry — Inventory has no graceful degradation because overselling is not a safe fallback; the correct response to Inventory unavailability is a fast 503, not a false confirmation.

Key takeaways

Sources & further reading