Reliability & Scale · Lesson 18
Resilience & disaster recovery
A system that either works perfectly or falls over completely is not resilient — it is brittle. The goal is never zero failures; it is ensuring that when a dependency breaks, the rest of the system keeps moving at reduced capacity rather than stopping entirely.
By the end you'll be able to
- Apply bulkheads, load shedding, and backpressure to prevent a single failing dependency from cascading across a system.
- Define RTO and RPO precisely and use them to specify the right backup and recovery architecture for a given system.
- Explain why chaos engineering proves resilience in ways that documentation and code review cannot.
The difference between availability and resilience
High availability is about surviving the failure of an infrastructure component — a crashed server, a flipped AZ, a misconfigured load balancer. Resilience is a different and complementary goal: surviving the misbehaviour of a dependency that is technically still alive. A downstream service that responds in 8 seconds instead of 80 milliseconds is not down — but a caller that holds a thread open waiting for it is being quietly strangled. Resilience patterns are the bulkheads, pressure relief valves, and blast walls that prevent one struggling component from sinking everything around it.
The underlying principle is graceful degradation: when a dependency is unavailable or overloaded, return a degraded but useful response rather than an error. Serve a cached product list when the inventory service is slow. Show a notification count of "99+" when the count service is down rather than failing the page load. Omit the recommendations widget when the ML service times out. Each of these is worse than the full experience — and each is vastly better than a blank error screen.
Bulkheads: containing blast radius
The name comes from ship design. A bulkhead is a watertight wall between compartments of a hull. When one compartment floods, the walls contain the water. The ship loses some buoyancy but does not sink. In software, a bulkhead is a resource isolation boundary that prevents one slow or failing dependency from consuming every connection, thread, or memory allocation in the process.
The most concrete implementation is separate connection pools per downstream dependency. Suppose a service has three dependencies: a primary database, a recommendations service, and a notification service. Without bulkheads, all outbound I/O shares one system-level connection pool or thread pool. If the recommendations service starts responding in 10 seconds, threads pile up waiting for it. Soon the pool is exhausted — every request, including those that never touch recommendations, queues behind waiting threads. The primary database calls, which were fast, can no longer get a thread. The entire service becomes unresponsive because of a component it relies on for one optional feature.
With bulkheads: a pool of 20 threads handles recommendations. A separate pool of 50 handles primary database calls. The notification pool has 10. When recommendations stalls, it fills its own 20-thread pool and begins rejecting new requests to that dependency. The primary database pool is untouched. The service degrades (no recommendations) but continues serving its core function.
Load shedding: choosing which work to drop
When a service is overloaded it has three options: accept all incoming requests and serve them slowly (adding to the queuing problem), crash, or deliberately reject a fraction of incoming work to protect its ability to serve the rest. Only the third is resilient. Load shedding is the deliberate act of returning a fast error — typically HTTP 503 or 429 — to some requests so that others complete successfully.
Stripe's engineering team has written publicly about their load shedding approach (see sources). The key insight is that not all requests have equal priority. A request to process a payment must be attempted even under extreme load. A request to fetch the account dashboard can be dropped — the user gets a "try again" message, which is frustrating but not catastrophic. Load shedding works by classifying requests by priority at the edge and returning 503 to lower-priority traffic once a configurable concurrency threshold is exceeded. See Rate Limiting Algorithms (Lesson 03) for the algorithmic primitives (token bucket, concurrency limiters) used to implement the concurrency threshold.
Backpressure: pushing the signal upstream
Load shedding says "I am too busy; go away." Backpressure is a related idea applied to internal queues and streaming systems: rather than accepting work into an unbounded queue and hiding the overload, the system signals congestion to the sender so the sender can slow down, instead of the queue growing without bound.
In a producer-consumer pipeline, the concrete mechanism is a bounded queue with a finite capacity. When the queue is full, the producer's call to enqueue() blocks or returns an error immediately, instead of succeeding. The producer receives immediate signal that the consumer is not keeping up. It can either shed that unit of work (drop the message, return an error to its caller) or apply backpressure upstream — block its own caller until the queue drains. This is how TCP flow control works: when the receive buffer fills, the TCP window shrinks to zero, stopping the sender at the transport layer. Designing application-level queues the same way prevents the classic failure mode where a slow consumer causes unbounded memory growth until the process OOMs and crashes.
Timeouts everywhere
Every outbound network call must have a timeout. A missing timeout converts a slow dependency into a thread that waits indefinitely, which is the entry point to the bulkhead failure described above. "Everywhere" is not rhetorical: the HTTP client to the upstream service, the database connection, the cache client, the DNS lookup inside the HTTP client, and the TCP connection establishment all need individual timeouts. Default settings for many HTTP client libraries are either infinite (no timeout at all) or dangerously long (60+ seconds). Set them explicitly at a level consistent with your SLO.
Timeouts should be paired with retries and circuit breakers. A single timeout → immediate retry is often correct for transient network glitches, but must be bounded: see Retries & Exponential Backoff (Lesson 05) for why unbounded retries amplify load on a struggling service. When a dependency is persistently slow or failing, the circuit breaker pattern — see Circuit Breaker Pattern (Lesson 06) — stops calling it altogether for a recovery window, which both protects the dependency and frees your threads immediately.
Disaster recovery: RTO and RPO
Resilience patterns protect against transient and partial failures. Disaster recovery (DR) addresses a different scenario: the data is gone, or the entire system is destroyed, and you must rebuild. Two numbers define the envelope of that rebuild:
Recovery Time Objective (RTO) is the maximum elapsed time from the moment of a disaster until the service is restored to full operation. RTO is a contractual commitment about speed of recovery. An RTO of 4 hours means users can expect service to resume within 4 hours of a declared disaster — and the company has committed to investing whatever infrastructure is necessary to meet that. An RTO of 15 minutes implies active hot-standby infrastructure; an RTO of 24 hours may allow cold restore from nightly backups.
Recovery Point Objective (RPO) is the maximum amount of data loss the business can tolerate, measured in time. An RPO of 1 hour means you are prepared to lose up to one hour of transactions. If a disaster occurs at 14:37 and your last good backup is from 13:45, you have lost 52 minutes of data — inside your RPO. An RPO of zero means no data loss is acceptable, which requires synchronous replication (writing is not acknowledged until it is confirmed on the replica). An RPO of 24 hours allows nightly backups with no real-time replication.
| RTO | RPO | Typical architecture | Example |
|---|---|---|---|
| < 1 min | ~0 s | Multi-region active-active; synchronous cross-region replication | Real-time payment processing |
| 1–15 min | < 1 min | Multi-AZ with automatic failover; async replication with low lag | SaaS API serving thousands of tenants |
| 1–4 hours | 1 hour | Warm standby in a second region; hourly snapshot backups | Internal analytics platform |
| 24 hours | 24 hours | Cold restore from nightly backups; manual runbook-driven recovery | Archive or audit-log service |
The lower the RTO and RPO, the higher the infrastructure cost. A zero-RPO system must acknowledge every write only after it has been confirmed on a synchronous replica — which adds the network round-trip to the replica to every write latency. This is the same consistency–availability tension from Consistency & the CAP theorem (Lesson 16), expressed in DR terms.
Backups and restore drills
A backup that has never been restored is a hypothesis, not a guarantee. The most common disaster-recovery failure mode is not the absence of backups; it is discovering during an actual incident that the restore procedure is broken, the backup files are corrupt, or the process takes twice as long as the RTO allows. The only fix is to run restore drills on a schedule — quarterly at minimum — and to treat a successful restore as a measurable metric, not a one-time event.
Backup strategy must match the RPO. Point-in-time recovery (PITR) — continuous archival of write-ahead log segments, as supported by PostgreSQL and Amazon RDS — allows restoring to any second within the retention window, not just the last snapshot. For an RPO of 5 minutes, PITR is the right tool; an hourly snapshot leaves a 55-minute window of potential loss.
Runbooks are as important as the infrastructure. A runbook is a step-by-step operating procedure for a specific failure scenario: what to verify, what commands to run, in what order, and how to confirm the restore succeeded. The runbook exists so that an on-call engineer who has never personally performed a full restore can do so at 3 AM without improvising. Keep runbooks version-controlled, link them to the relevant monitoring alerts, and review them after every drill that surfaces a gap.
Chaos engineering: proving resilience by breaking things
Code review and load testing check that a system behaves correctly under expected conditions. Chaos engineering — deliberately injecting failures into a running system — checks that the resilience patterns actually fire as designed. Netflix's Chaos Monkey is the most famous example: it terminates random production instances to verify that the rest of the fleet handles the loss gracefully, and to surface any implicit assumptions about specific instances always being available (see sources).
Chaos experiments are not random destruction. A well-designed experiment has a hypothesis ("if the recommendations service becomes unavailable, the homepage still loads and shows a static fallback"), a defined blast radius (only one of five recommendation servers, or only in a canary deployment group), and measurable criteria for success. The goal is to find failures in a controlled way before a real incident finds them in an uncontrolled way. Common experiments: terminate random instances, inject network latency (e.g., 300 ms added to all DB calls), exhaust a specific connection pool, simulate a slow DNS response, or drop packets between two services.
Start with a hypothesis, run in non-production first, and verify the hypothesis is confirmed before running in production at limited blast radius. Never run chaos experiments during peak traffic hours — the goal is discovery, not self-inflicted incidents.
Graceful degradation decision flow
Under the hood: bulkhead execution — a traced example
Here is exactly what happens inside a Java or Go service that uses separate semaphore-based connection pools as bulkheads. The mechanism is a counting semaphore: each pool has a max-concurrency value. Each outbound call must acquire a permit before sending the request. When all permits are held by in-flight calls, new callers either block for a maxWait duration or receive an immediate rejection. This is the moment that contains the blast radius — the caller can immediately return the fallback instead of hanging.
// Conceptual bulkhead using a semaphore (Go pseudocode)
type Bulkhead struct {
sem chan struct{} // buffered channel as a counting semaphore
timeout time.Duration
}
func NewBulkhead(maxConcurrency int, timeout time.Duration) *Bulkhead {
return &Bulkhead{
sem: make(chan struct{}, maxConcurrency), // capacity = max permits
timeout: timeout,
}
}
func (b *Bulkhead) Execute(ctx context.Context, fn func() (interface{}, error)) (interface{}, error) {
// Attempt to acquire a permit within the timeout
select {
case b.sem <- struct{}{}: // slot acquired
defer func() { <-b.sem }() // release on return
return fn()
case <-time.After(b.timeout):
// Bulkhead saturated — return immediately, never hang
return nil, ErrBulkheadFull // caller routes to fallback
case <-ctx.Done():
return nil, ctx.Err()
}
}
// Usage: separate bulkheads per dependency
var (
dbBulkhead = NewBulkhead(50, 100*time.Millisecond) // primary DB
recBulkhead = NewBulkhead(20, 50*time.Millisecond) // recommendations
ntfBulkhead = NewBulkhead(10, 50*time.Millisecond) // notifications
)
// In the request handler:
result, err := recBulkhead.Execute(ctx, func() (interface{}, error) {
return fetchRecommendations(userID)
})
if errors.Is(err, ErrBulkheadFull) {
recommendations = defaultRecommendations() // fallback — never blocks
}
The trace of the failure scenario from the diagram above:
The key observation: the database pool's in_flight counter stays well below its capacity throughout. The degradation is contained entirely within the recommendations pool. Response times for pages that never touch recommendations are unaffected.
How to operate it: cascading-failure triage and RTO/RPO worked example
Cascading-failure triage table
| Symptom | Root cause | Fix |
|---|---|---|
| All endpoints return 503 even though only one dependency is slow | No bulkhead: the slow dependency's thread pool (or connection pool) is shared with all other work. All threads are held waiting. | Isolate per-dependency connection/thread pools. Apply bulkhead pattern. Add per-dependency timeouts so threads are released after a bounded wait. |
| Service crashes under load during a downstream outage | Retries amplify load: every in-flight request retries on timeout, multiplying the request rate at the moment the dependency is most stressed. Combined with no circuit breaker, the retry storm exhausts the caller's own resources. | Cap total retries to 1–2 attempts with exponential backoff and jitter (see Lesson 05). Wire the circuit breaker (see Lesson 06) — once the error rate trips the breaker, retries stop entirely. |
| Graceful degradation returns stale cached data from 6 hours ago | Cache TTL is too long; the fallback cache was populated once at service start and never refreshed because the dependency was always available in testing. | Set a maximum staleness TTL on the fallback cache (e.g. 5 minutes). Use a background refresh pattern: proactively refresh the cache before it expires, not reactively when it misses. Test fallback paths explicitly in CI by mocking the dependency as unavailable. |
| Queue grows without bound during a consumer outage; process OOMs and crashes | Queue is unbounded: the producer never applies backpressure. The consumer outage causes the producer to accumulate work indefinitely. | Apply a bounded queue with a defined overflow policy: drop-oldest, drop-newest, or block-with-timeout. Return a backpressure signal (HTTP 503 or an application-level error) to the upstream caller. Alert when queue depth exceeds a threshold before it reaches the limit. |
| RTO is 4 hours on paper but actual restore took 11 hours during a drill | Runbook references infrastructure that no longer exists; restore procedure was never timed; dependencies between restore steps were not documented. | Run restore drills quarterly. Time every step. Update the runbook after every drill. Automate the restore procedure as a script with a dry-run mode so the steps are executable, not just documented. |
RTO / RPO worked example
A SaaS invoicing API stores invoices in PostgreSQL. Current state: nightly snapshot backups, no replication, single AZ. The business has accepted the following requirements: no more than 2 hours of data loss; service must be restored within 30 minutes after a full database failure.
RPO = 2 hours. Nightly snapshots (24-hour RPO) do not meet this. The right tool is continuous WAL archival to S3 (PITR on Amazon RDS), which allows restore to any point within the retention window. With PITR, worst-case data loss is the time since the last archived WAL segment, typically under 5 minutes — well inside the 2-hour RPO.
RTO = 30 minutes. A cold restore from an S3 backup to a new RDS instance typically takes 15–45 minutes depending on database size. For a 30-minute RTO, the warm-standby approach is safer: keep a pre-provisioned standby instance that continuously receives replicated data and can be promoted in under 2 minutes (Amazon RDS Multi-AZ achieves this). The standby instance runs in a different AZ and is available for promotion without requiring a full restore from backup.
Architecture decision. Enable RDS Multi-AZ (provides RTO of ~2 minutes, RPO of ~0 seconds for in-region failures) and continuous PITR with a 7-day retention (protects against accidental data deletion, logical corruption, and events that damage both AZ copies). Run a restore drill quarterly: restore from PITR to a new RDS instance, verify row counts and referential integrity, measure the elapsed time, and update the runbook.
When asked "how would you handle a dependency going down?", weak answers name the circuit breaker and stop. A strong answer covers the full stack: the timeout that detects the problem, the bulkhead that contains the thread exhaustion, the circuit breaker that stops new calls, and the fallback that returns a degraded response. Then address DR separately: define the RTO and RPO requirements, and show you understand that those numbers determine the backup strategy — not the other way around. Finally, mention chaos engineering as the only way to know the resilience patterns actually work.
Retries on their own are not resilience — they are load amplification during outages. Consider a service with 1,000 requests/second and a dependency that goes down. Without a circuit breaker, every request retries twice, tripling the load to 3,000 RPS on a service that cannot handle 1,000. The retry storm overwhelms the recovery attempt and extends the outage. Retries must always be paired with a circuit breaker that trips after a threshold of failures and stops calling the dependency entirely for a recovery window. The circuit breaker is what converts "retry amplifies the problem" into "retry handles transient glitches while the circuit protects against sustained failures."
Chaos engineering without observability is just outage simulation. Before running any chaos experiment, confirm you have metrics for the thing you are testing: queue depth, bulkhead saturation rate, circuit breaker state transitions, fallback activation rate, and end-to-end error rate by endpoint. Without these, a chaos experiment tells you "something broke" but not whether the resilience pattern fired correctly. With them, you can verify the exact hypothesis: "bulkhead saturation rate for recommendations spiked to 100% at t+3s, primary DB error rate stayed at 0%, fallback activation rate was 100% — pattern worked as designed."
By the numbers
Make it concrete. The service is an e-commerce checkout API. It calls a Recommendations dependency at 500 QPS; each call takes on average 200 ms. The checkout database holds 500 GB and is backed up to S3. Workers are CPU-bound and run at roughly 70% utilisation during normal traffic.
Bulkhead pool sizing: Little's Law
The right pool size for a dependency equals the peak number of concurrent in-flight calls — anything larger wastes memory; anything smaller starves throughput. From Little's Law (L = λ × W):
pool_size = QPS_dep × latency_dep (in seconds)
= 500 req/s × 0.200 s
= 100 concurrent connections
Cap the Recommendations pool at 100. When the dependency slows from 200 ms to 2,000 ms under load, the same formula tells you the pool would need 500 × 2.0 = 1,000 slots — but by capping at 100, new calls fail fast after maxWait = 50 ms and return the fallback. The primary DB pool (capped separately at 50) is never touched. Stripe's public architecture (Stripe Engineering — Scaling your API with rate limiters) uses exactly this pattern: separate concurrency limiters per dependency type.
RPO: backup interval determines maximum data loss
RPO = backup interval for snapshot-based backups. With hourly snapshots, a disaster at 14:58 hits against the 14:00 snapshot — up to 58 minutes of data loss. To meet an RPO of 5 minutes, you need continuous WAL archival (PITR), where each WAL segment is shipped to S3 every 60–300 seconds, limiting loss to that shipping interval (PostgreSQL PITR documentation).
| Backup method | Backup interval | Maximum data loss (RPO) | Use case |
|---|---|---|---|
| Hourly snapshots | 60 min | Up to 60 min | Internal analytics, low-churn data |
| 15-min snapshots | 15 min | Up to 15 min | Moderate-value operational data |
| PITR (WAL shipping) | ~5 min WAL segments | Up to ~5 min | SaaS, financial, most production APIs |
| Synchronous replica | 0 (commit-synchronous) | ~0 s | Payments, ledgers, zero-loss systems |
RTO: how long does a restore actually take?
The checkout database is 500 GB. A cold restore from S3 to a new RDS instance at AWS S3-to-RDS throughput of roughly 100 MB/s takes:
restore_time = database_size ÷ restore_throughput
= 500 GB × 1,024 MB/GB ÷ 100 MB/s
= 512,000 MB ÷ 100 MB/s
≈ 5,120 s ≈ 85 minutes
An 85-minute cold restore cannot meet a 30-minute RTO. The fix is a warm standby (RDS Multi-AZ), where promotion — not full restore — takes 28–60 seconds. Use PITR as the safety net for logical corruption; use the warm standby for RTO. Both run together (AWS RDS PITR docs).
Load-shedding threshold: when to start dropping work
Workers run at 70% CPU utilisation at normal load. The load-shedding threshold is set at 85% worker utilisation. Below 85%: all requests proceed. At or above 85%: low-priority requests (recommendations, dashboard analytics) receive HTTP 503 immediately; high-priority requests (checkout, payment) continue. This protects the 15% headroom needed for retries and GC pauses (AWS Builders' Library — load shedding).
headroom = 1 − shed_threshold = 1 − 0.85 = 15%
normal_load = 70% utilisation → 15% gap before shedding triggers
at 85% shed begins → high-priority work still completes → system stays stable
Decision math: choosing pool size, backup interval, and RTO feasibility
Given dependency QPS and latency, the formula is: pool_size = QPS × latency_s. Round up to the nearest 10 and add 10% headroom for burst. For RPO: pick the backup method whose interval fits inside the RPO. For RTO: compare database_size ÷ restore_rate against the RTO target — if it exceeds it, you need a warm standby, not cold restore. For load shedding: set the threshold at 1 − GC_headroom − retry_headroom, typically 80–85%.
🧠 Quick check
1. A bulkhead in software most directly prevents:
The bulkhead pattern isolates resource pools (threads, connections) per dependency. When the recommendations service stalls and fills its dedicated pool, the database pool — which belongs to a completely separate bulkhead — is unaffected. The failure is contained within the recommendations compartment.
2. An RPO of 15 minutes means:
RPO (Recovery Point Objective) measures data loss tolerance in time. An RPO of 15 minutes means if disaster strikes, you may need to restore from a backup that is up to 15 minutes old — and the business has accepted that 15 minutes of transactions may need to be replayed or accepted as lost. RTO (Recovery Time Objective) is about speed of recovery, not data loss.
3. Which technique directly applies backpressure to producers?
A bounded queue is the direct implementation of backpressure. When the consumer falls behind and the queue fills to capacity, the next producer call to enqueue() either blocks until space is available or fails immediately — giving the producer a signal to slow down or shed that unit of work. An unbounded queue never sends this signal; the queue grows until memory is exhausted.
4. Why must chaos experiments have a defined hypothesis before running?
Without a hypothesis ("if X fails, Y continues and Z activates"), a chaos experiment produces noise, not signal. The hypothesis defines the success criteria: which metric should spike (bulkhead saturation, fallback activation), which metric should stay flat (primary DB error rate), and what the overall outcome should be. The experiment either confirms the hypothesis — proving the resilience pattern works — or refutes it, revealing a gap to fix before a real incident does.
5. Load shedding differs from rate limiting primarily in that load shedding:
Rate limiting enforces a per-client quota over time (e.g. 100 requests/minute per API key). Load shedding responds to the current load level of the service itself: when the server's concurrency exceeds a threshold, it classifies incoming requests by priority and drops the least critical ones — regardless of which client sent them. The goal is to protect server capacity, not to enforce fairness between clients.
✍️ Exercise: design a resilient checkout service
An e-commerce checkout service calls three downstream services on every request: Inventory (checks if items are in stock), Fraud Detection (scores the transaction risk), and Recommendations (shows "you might also like" items). The service currently uses a single shared HTTP client with no timeout, no bulkheads, and no circuit breakers. The Fraud Detection service has been intermittently slow (8–15 second response times) for the past week, causing checkout to become unresponsive during those windows.
Design the resilience improvements, specifying: (a) which patterns to apply to each dependency, (b) what the fallback for each is, (c) the RTO and RPO you would target for the checkout database, and (d) how you would prove the improvements work.
Model answer:
(a) Patterns per dependency:
- Inventory: Bulkhead (dedicated connection pool, max 30 concurrent), 500 ms timeout, circuit breaker (trip at 50% error rate over 20-second window). Inventory is critical — without a confirmed stock check you cannot complete a purchase. No fallback for optimistic checkout exists unless you implement a reservation system.
- Fraud Detection: Bulkhead (dedicated pool, max 20 concurrent), 2 s timeout, circuit breaker (trip at 40% error rate). Fallback: accept the transaction with a low-risk score and flag it for asynchronous manual review. Payment is not blocked; fraud is caught post-hoc. This is the correct trade-off: slow fraud scoring is worse than delayed fraud detection.
- Recommendations: Bulkhead (dedicated pool, max 10 concurrent), 200 ms timeout, circuit breaker. Fallback: return a static list of top-selling items from a local cache refreshed every 5 minutes. This is a non-critical feature — a degraded experience (static recommendations) is fine.
(b) Fallbacks: Inventory has no safe fallback (overselling inventory is a business problem). Fraud Detection fallback is accept-and-flag. Recommendations fallback is a cached static list.
(c) RTO and RPO: Checkout is revenue-critical. Target RTO = 5 minutes (automated failover; any manual step must complete within 5 minutes), RPO = 30 seconds (synchronous replication within a Multi-AZ database; up to 30 seconds of in-flight transactions may be at risk during a failover window). Implement RDS Multi-AZ for automated promotion and continuous PITR for point-in-time recovery of accidental data corruption.
(d) Proving it works: Write a chaos experiment for each resilience path: (1) introduce 8-second latency to Fraud Detection calls — verify checkout error rate stays at 0%, fraud_detection_fallback_rate metric rises to 100%, checkout latency stays under 2.5 s; (2) terminate the Recommendations service — verify recommendations_fallback_rate is 100% and response time is unaffected; (3) inject a DB failover — verify application-visible 503 errors are contained to under 35 seconds and connection pool reconnects without operator intervention. Run drills quarterly and measure elapsed time to recovery against the RTO target.
Rubric: Full marks for identifying correct patterns per dependency with justified timeouts and clear fallbacks. Partial marks for correct patterns without fallback design. Bonus for noting the asymmetry — Inventory has no graceful degradation because overselling is not a safe fallback; the correct response to Inventory unavailability is a fast 503, not a false confirmation.
Key takeaways
- Resilience is not the same as availability. Availability keeps infrastructure running. Resilience keeps the system useful when a dependency misbehaves. Both are necessary; neither substitutes for the other.
- Bulkheads isolate resource pools per dependency so that a slow downstream service can only consume its own allocation of threads or connections — not yours.
- Load shedding protects a service under overload by fast-failing lower-priority requests. Prioritise before shedding — not all requests are equally critical.
- Backpressure through bounded queues gives producers a signal that consumers are falling behind, preventing the silent unbounded growth that ends in OOM crashes.
- RTO and RPO are commitments, not metrics — they define what infrastructure you must build. A 30-minute RTO needs a warm standby; a zero-RPO needs synchronous replication. Every step below that has a cost you must justify.
- Chaos engineering is the only way to prove resilience patterns actually fire. Documentation and code review describe intent. A chaos experiment under real traffic confirms reality.
Sources & further reading
- AWS Well-Architected Framework – Reliability Pillar
- Google SRE Book – Chapter 22: Addressing Cascading Failures
- Stripe Engineering – Scaling your API with rate limiters (load shedding section)
- Netflix Tech Blog – The Netflix Simian Army (Chaos Monkey)
- Principles of Chaos Engineering
- AWS Docs – Restoring a DB instance to a specified time (PITR)