Failure Case Studies · Lesson 01

What Causes API Failures

API outages are not random acts of the universe. They fall into a short list of repeating failure modes. Learn to name them before you encounter them in production — or in an interview room — and you'll always know where to look first.

⏱ 14 min Difficulty: advanced Prerequisites: retries & backoff, circuit breakers, rate limiting

By the end you'll be able to

Name the seven categories of API failure and describe how each one triggers a degradation or outage.
Trace a cascading failure through a multi-service chain and identify the original root cause.
Map each failure category to the correct mitigation strategy and the lesson that covers it.
Apply a defense-in-depth checklist to any new service you build or review.

A map of failure modes

Post-incident reports from major cloud providers share a quiet pattern: the same handful of causes appear again and again, dressed in different costumes. A payment API that melted in 2014 failed for the same underlying reason as a streaming platform that stumbled in 2022. The details differ; the category is identical.

That's useful news. It means you don't need to memorize every famous outage in history. You need to internalize the taxonomy — the short list of root causes — and then recognize which category you're staring at when something goes wrong. The seven categories below cover the vast majority of real-world API failures. Think of them as a mental checklist you can run through any time you design a new service, review a colleague's PR, or sit in a system design interview.

Overload / traffic spikes — more load than the system was sized to handle.
Dependency failures and cascades — a slow or broken downstream service drags the whole chain down.
Bad deploys and config changes — a human-initiated change that introduces a defect or breaks an assumption.
Data problems — corrupt records, schema drift, or unexpected null values that cause runtime errors or silent corruption.
Retry storms / thundering herd — retries from clients amplify the very overload they're trying to recover from.
Missing timeouts — the absence of a deadline that causes threads or connections to block indefinitely.
Single points of failure — a component with no redundancy whose failure takes the entire system offline.

The sections below group these into four natural clusters, explain the mechanics of each, and point you to the mitigation lessons that go deeper.

Traffic and overload

Imagine a restaurant kitchen that runs smoothly at 80 covers but has never tested 200. On the night a food blogger's review goes viral, every ticket queue backs up, the expediter shouts incoherently, and food quality collapses. The kitchen didn't break — it just received more than it was designed to handle simultaneously.

APIs work the same way. Every service has a capacity ceiling — a rate of requests beyond which latency rises, queues fill, and errors start. Traffic spikes are the most common way that ceiling gets discovered in production rather than in a load test.

What it causes: Elevated error rates, timeout cascades into dependent services, and CPU/memory exhaustion that can crash the process entirely.

The fix: Rate limiting at the edge enforces a contract between callers and your service — it refuses excess load gracefully rather than absorbing it silently. See Lesson rel-03: Rate Limiting for implementation details.

Retry storms and the thundering herd

Here is a cruel irony: retries, designed to make systems resilient, can become the instrument of their destruction. Picture 50,000 mobile app clients that all experience a timeout simultaneously — perhaps because of a brief network blip — and each one immediately fires off a retry. Now the service, already under stress, receives not 50,000 requests but 100,000 within seconds. The spike they created is worse than the original problem.

The thundering herd is a closely related pattern that appears after a recovery: a cache expires, a server restarts, or a circuit breaker half-opens, and every client that was queued up rushes in at once, instantly recreating the overload.

What it causes: A positive feedback loop — overload causes retries, retries deepen the overload, which causes more retries — that can make a brief degradation into a sustained outage lasting many times longer than the original fault.

The fix: Exponential backoff with random jitter spreads retries out over time so that the collective retry load is never more than the original baseline. See Lesson rel-05: Retries and Backoff.

⚠️ Common trap

Retry storms: adding retries without jitter or backoff makes overload worse, not better. A retry that fires immediately at a fixed interval is just increased traffic. Never ship a retry loop without sleep(base * 2^attempt + random_jitter) and a hard cap on attempts.

Slow and missing dependencies

Consider a chain of dominos stood on edge. Nudge the first one and the impulse travels the entire length of the chain. Distributed systems are the same, except the dominos can topple in reverse: when the last service in the chain slows down, the effect propagates backward toward the client.

A slow database causes Service C to degrade. Service B holds connections open waiting for C. Service A fills its thread pool waiting for B. The client sees errors — even though nothing between the client and Service A is broken.

This is a cascading failure: a fault in one layer propagating upstream, often amplified at each hop. The most dangerous ingredient is a missing timeout. When Service B has no deadline on its call to Service C, it holds a thread open indefinitely. As C slows, B accumulates blocked threads until it has none left for new requests.

What it causes: Transitive degradation — a problem in one downstream service takes out everything that depends on it, all the way back to the client.

The fix (timeouts): Every external call must carry an explicit timeout. A timeout converts "blocked forever" into "failed fast," so the caller can return an error quickly and free its resources. See Lesson rel-05: Retries and Backoff.

The fix (cascades): A circuit breaker monitors failure rates on downstream calls and, once a threshold is crossed, immediately returns an error without making the call at all. This stops propagation and gives the downstream service time to recover. See Lesson rel-06: Circuit Breakers.

🎯 Interview angle

In system design interviews, name the failure mode before proposing the mitigation. "This call can cascade because Service B has no timeout on its call to C — so a slowdown in C will exhaust B's thread pool" is a stronger opening than jumping straight to "add a circuit breaker." Naming the mechanism signals that you understand why the pattern exists, not just its name.

Deploy and config changes

The majority of production outages are not caused by sudden external events — they are caused by humans changing something. A bad deploy is the most obvious form: a commit introduces a logic error, the tests didn't catch it, and the error surfaces at production load within minutes of rollout.

Config changes are subtler and often more dangerous. A configuration value — a connection pool size, a feature flag, a timeout threshold — looks small and harmless. It doesn't go through a normal code review. It doesn't run tests. And because it takes effect immediately on all instances, there is no gradual ramp. Config changes have caused some of the largest known outages in internet history.

What it causes: Immediate, production-wide degradation or complete outage — often with no signal until user errors spike, because the change itself doesn't throw an exception at deploy time.

The fix: Canary and staged rollouts send new code to a small fraction of traffic first. If error rates rise on the canary slice, the rollout is halted before most users are affected. A kill switch — a feature flag that can revert behavior without a redeploy — is the fastest recovery tool when a config change goes wrong. See Lesson rel-04: API Gateway for gateway-level canary and traffic-shaping patterns.

Data and single points of failure

Data problems

A payment system processes millions of records cleanly — until one record arrives where a required field is null, a string is 2,000 characters where the schema assumed 255, or a date format changed in an upstream producer without notice. A single malformed record can crash a queue consumer, corrupt a batch job, or silently drop writes.

Schema drift is the slow-burn version: a producer starts sending a new field without announcing it; the consumer doesn't validate; six months later a downstream query breaks because it assumed a column never existed. Data problems are often invisible until they're catastrophic because the system appears healthy while the corruption accumulates.

What it causes: Processing errors that crash consumers, silent data loss from dropped writes, integrity violations that manifest far later than the original defect.

The fix: Strict input validation at ingest time — reject records that don't conform to the schema before they enter your system. Monitoring on data shape (null rates, field distributions, record counts) catches drift as it starts rather than after it has caused a month of silent corruption.

Single points of failure

A single point of failure (SPOF) is any component that, if it fails, takes down the entire system with no fallback. It's the architectural equivalent of a single load-bearing column holding up an entire bridge: perfectly fine in normal conditions, catastrophic when that one thing fails.

SPOFs hide in plain sight. A single database primary with no replica. A single API gateway instance. A single availability zone. A shared authentication service with no redundancy. They're often introduced during a system's early life — when simplicity is the right trade-off — but become liabilities as the system matures and the cost of downtime grows.

What it causes: Total service unavailability for any failure of the single component — hardware failure, zone outage, OOM kill, or planned maintenance.

The fix: Redundancy — running multiple instances of each component so that no single failure causes a complete outage. Cell-based architecture takes this further by partitioning the system so a failure in one cell (serving one subset of users) cannot affect other cells. See Lesson rel-08: Load Balancing for the traffic-distribution patterns that make redundancy work.

✅ Defense-in-depth checklist

Before shipping any new service, run this checklist:

Rate limits at the edge — enforce a traffic ceiling before it reaches your service.
Timeouts everywhere — every external call, every database query, every queue read.
Circuit breakers on all external dependencies — stop cascade propagation at the source.
Bulkheads — isolate thread pools or connection pools per dependency so one slow service can't exhaust resources for everything.
Monitoring on data shape — not just error rates, but null rates, field distributions, and schema conformance.
Staged rollout + kill switch — canary before full deploy; config revert without a code push.
No SPOFs — at minimum, two instances of every stateful component.

How to read your next outage

When you're handed a post-incident report, or you're live in a war room watching error rates climb, the taxonomy above is your first tool. Before proposing solutions, diagnose the category:

Is traffic elevated? → Overload / retry storm. Look at request rate graphs and retry counts. If retries are spiking, check whether they have backoff and a cap. See rel-03 and rel-05.

Is one service slow while others are healthy? → Cascade from a dependency. Look for missing timeouts in the call chain. Find the root slow service and apply a circuit breaker so the rest of the system stops waiting for it. See rel-06.

Did errors start at a specific timestamp that aligns with a deploy? → Bad deploy or config change. The fix is a rollback, not a hotfix. A staged rollout would have caught this before 100% of traffic was affected. See rel-04.

Are errors confined to specific record IDs or data shapes? → Data problem. Quarantine the bad records, fix the validation gap, then replay from a clean point. Monitoring on data shape would have given early warning.

Did one host or zone go offline and take the whole system with it? → SPOF. The mitigation is architectural — add redundancy. See rel-08.

Naming the category precisely changes how you communicate in an incident. "We have a cascade caused by a missing timeout on the payments call" is actionable and points directly at the fix. "Something is broken" is not. The taxonomy is what turns confusion into a resolution path.

Under the hood: the precise failure mechanism

The cascade diagram above shows the shape of the problem. This section walks through the mechanism — exactly what is happening inside each service's runtime, why each step is inevitable given how thread pools and connection pools work, and what the ops team would see in their dashboards as the failure develops.

The worker-thread model and why it makes cascades transitive

Most HTTP servers (JVM, Node cluster, Python WSGI, Go net/http) handle each in-flight request by assigning it a worker thread (or goroutine, or event-loop tick with an async handle) for the lifetime of that request. The thread is occupied — unavailable for any other request — until the response is returned or an error is thrown. This is the fundamental constraint that turns a slow downstream dependency into upstream resource starvation.

When Service B calls Service C, Service B's worker thread blocks on the network socket, waiting for bytes from C. It does no useful work while it waits. If C is slow — taking 30 seconds instead of 30 milliseconds — that thread is tied up for 30 seconds. Now multiply by concurrency: if Service B has a thread pool of 200 and Service C is slow for long enough, all 200 threads are waiting on C simultaneously. The 201st incoming request to Service B finds no available thread. It either queues (consuming memory and increasing latency further) or is immediately rejected with a 503. From the caller's perspective, Service B is the slow service — even though nothing inside Service B is wrong.

Step-by-step: how the cascade develops

Slow DB → Service C holds worker threads open. A database query that normally returns in 5 ms starts taking 4 000 ms — due to lock contention, a missing index on a hot query, an I/O spike, or a long GC pause on the DB host. Every request that Service C is handling which needs that query now has its worker thread blocked on the DB socket, waiting for rows that take 4 seconds to arrive. Service C is not "down" — it accepts new connections — but each request takes 80× longer than expected.
Service C's thread pool saturates. Service C's thread pool has a fixed ceiling (say, 100 threads). At normal latency (5 ms per request) and normal load (500 req/s), average in-flight requests = 500 × 0.005 = 2.5 threads in use at any moment — well within the pool. At 4 000 ms per request: 500 × 4.0 = 2 000 threads needed. The pool is exhausted after the first ~100 requests queue up. Little's Law makes this precise: L = λW (mean concurrency = arrival rate × mean service time). When W explodes, L explodes proportionally.
Service C starts queuing then rejecting requests. Once the thread pool is full, new incoming requests hit the server's accept queue. If the queue is bounded (most production servers set a limit), the OS starts dropping SYN packets — callers see connection-refused or connection-reset rather than a slow response. If the queue is unbounded, memory grows until the process OOM-kills. Either way, Service C is now actively failing for new callers.
Service B's connection pool to C fills — all connections waiting. Service B maintains a connection pool to Service C: a set of pre-established TCP connections kept alive to avoid the overhead of a fresh TCP+TLS handshake on every call. That pool has a maximum size (say, 50 connections). Under normal conditions only a few are in use at once. Now, with C slow, each connection that B sends a request on is occupied — waiting for C's response — for 4 seconds instead of 5 ms. After 50 concurrent requests to C, all pool connections are in-flight. The 51st request from B that needs to call C cannot borrow a connection; it must wait in the pool's wait queue for one to become free.
Service B's own worker threads block waiting for connection pool slots. Service B's worker thread that handles a request needing to call C calls pool.borrow(timeout=…). If no connection is available and the borrow-timeout has not been set (or is set very high), that worker thread parks — it does nothing but wait for a pool slot. This is the transitive step: a resource problem inside C (thread pool) has now caused a resource problem inside B (connection pool), which is now causing a resource problem in B's own thread pool.
Service B's thread pool saturates. Exactly the same mechanism as step 2, but one layer upstream. Service B's worker threads are now all parked waiting for connection-pool slots to call C. New requests arriving at B either queue or are rejected. From the outside, Service B looks exactly as broken as Service C, even though B's own code and infrastructure are healthy.
Service A sees Service B timing out or refusing. Service A calls Service B. It either gets a connection-refused (B's accept queue is full), a 503 immediately (B shed the load), or — most dangerously — it gets a connection, sends its request, and then waits… and waits… because B's worker that picked up the request is itself waiting for a pool slot to call C. Service A's worker thread is now blocked on the socket to B.
The cascade continues upstream: Service A saturates. For precisely the same reasons as Service B: Service A's worker threads are blocked waiting on B's slow responses. If Service A has no timeout on its call to B, threads accumulate indefinitely. Service A's own thread pool saturates. New client requests are rejected or queued.
The client sees errors. Clients receive 503s, connection timeouts, or — worst case — connection accepted but response never arrives (hanging request). If clients have retry logic with no backoff, they immediately retry, adding more load to an already-saturated A, deepening the cascade and extending the outage.

Why it is transitive and not contained: Each layer has finite, bounded resources (thread pool, connection pool). When a downstream layer is slow, it holds upstream resources occupied longer than designed. Once occupancy × latency exceeds pool capacity — a purely arithmetic consequence of Little's Law — the upstream layer also saturates. There is no mechanism that spontaneously absorbs the slowness; it flows upstream unless a timeout, circuit breaker, or bulkhead actively interrupts it.

What the ops team sees in logs and metrics

# T+0:00 — baseline, everything healthy service_c.db_query_latency_p99 = 6 ms service_c.thread_pool_active = 8 / 100 service_b.connpool_c.in_flight = 3 / 50 service_b.thread_pool_active = 12 / 200 service_a.thread_pool_active = 15 / 200 client_error_rate = 0.1 % # T+0:45 — DB query latency starts climbing (lock contention begins) service_c.db_query_latency_p99 = 840 ms ← was 6 ms service_c.thread_pool_active = 67 / 100 ← climbing fast service_b.connpool_c.in_flight = 11 / 50 ← starting to back up service_b.thread_pool_active = 34 / 200 service_a.thread_pool_active = 18 / 200 client_error_rate = 0.3 % # T+1:30 — Service C thread pool fully saturated; B's conn pool filling service_c.db_query_latency_p99 = 4200 ms service_c.thread_pool_active = 100 / 100 ← SATURATED service_c.request_queue_depth = 312 ← queue growing service_b.connpool_c.in_flight = 48 / 50 ← almost full service_b.connpool_c.wait_queue = 89 ← threads waiting for conn slot service_b.thread_pool_active = 141 / 200 service_a.thread_pool_active = 42 / 200 client_error_rate = 4.1 % # T+2:15 — Both B's conn pool and thread pool saturated; A starting to fill service_c.thread_pool_active = 100 / 100 SATURATED service_b.connpool_c.in_flight = 50 / 50 ← FULL service_b.connpool_c.wait_queue = 634 ← every B thread waiting service_b.thread_pool_active = 200 / 200 ← SATURATED service_a.thread_pool_active = 138 / 200 ← climbing client_error_rate = 38.0 % # T+3:00 — Full cascade; A saturated; clients see near-total failure service_c.thread_pool_active = 100 / 100 SATURATED service_b.thread_pool_active = 200 / 200 SATURATED service_a.thread_pool_active = 200 / 200 SATURATED client_error_rate = 94.7 % # Root cause buried in service_c.db_query_latency — look upstream from the # first metric that spiked, not the last service to saturate.

The diagnostic pattern: the first metric to deviate identifies the layer where the original fault lives. The last service to saturate is furthest upstream. Work from the first spike, not the loudest alarm.

Symptom → cause → fix: diagnosing which layer is the bottleneck

Symptom in metrics/logs	Which layer is the bottleneck	Immediate fix
`db_query_latency_p99` spiking; Service C `thread_pool_active` near ceiling; C still accepting connections	Database — slow queries holding C's threads	Kill slow query / add index; set a short DB query timeout in C so it fails fast instead of blocking indefinitely
Service C `thread_pool_active` = 100 %; `request_queue_depth` growing; C returning 503 or dropping connections	Service C thread pool saturated	Trip the circuit breaker on B→C calls so B stops waiting; shed load on C's accept queue; scale C horizontally
Service B `connpool_c.wait_queue` non-zero and growing; B's own `thread_pool_active` climbing; B latency rising but B's CPU is low	Service B connection pool to C exhausted — B threads parked waiting for pool slots	Set a short `pool.borrow_timeout` on B's connection pool so B fails fast rather than parking threads; add a bulkhead (separate, smaller pool) for the C dependency
Service B `thread_pool_active` = 100 %; B responding with 503 or queuing; Service A latency climbing	Service B thread pool saturated — cascade has reached B	Trip circuit breaker on A→B calls; A should return a degraded response immediately rather than waiting; reduce B's upstream timeout in A
Service A `thread_pool_active` at ceiling; client error rate spiking; all downstream services appear slow from A's perspective	Full cascade — A saturated, root cause is somewhere downstream	Follow latency histograms downstream from A until you find the first service where latency spiked — that is the origin; fix there, then open circuit breakers top-down to drain thread queues
Latency p99 spiking across all services simultaneously; no single service shows thread saturation first	Likely a shared infrastructure layer (DNS, load balancer, network fabric, shared DB cluster)	Check infrastructure-level metrics (NIC errors, DNS resolution time, LB connection table exhaustion) rather than application code

🧠 Quick check

1. A downstream service starts returning errors after 10 seconds. Client threads are all blocked waiting. What failure mode is this?

When a call has no timeout, threads block indefinitely waiting for a response that either never arrives or arrives far too late. The symptom — all threads occupied, service unresponsive to new requests — is classic resource exhaustion from a missing timeout. A circuit breaker and explicit timeouts are the fix.

2. After a deploy, one server still runs old code because the deployment script failed silently on that host. What category is this?

A deployment that applies inconsistently — leaving some hosts on old code — is a partial deploy, a subtype of the bad deploy category. It can cause subtle bugs that are hard to reproduce because only some fraction of requests hit the affected host. Staged rollout tooling with automated verification catches this before it reaches all traffic.

3. A retry loop with no exponential backoff retries 10× per second per client. There are 50,000 clients. What emerges?

50,000 clients each retrying 10 times per second is 500,000 requests per second — potentially many times the original load. This is a textbook retry storm. The immediate retries amplify the overload that triggered the errors in the first place, making recovery impossible without backoff and jitter to spread the retry wave over time.

✍️ Exercise: PR review guardrails for a retry loop

You're the reviewer. A colleague opens a PR that adds a retry loop calling a payment service. The implementation retries on any 5xx response with no additional logic. What three guardrails would you require before approving? Write your answer before reading the model response.

Model answer:

Exponential backoff with jitter. Each retry attempt should wait min(cap, base * 2^attempt) + random(0, jitter) milliseconds before firing. Jitter prevents synchronized retry bursts when many clients failed at the same moment.
Maximum retry cap. The loop must stop after a bounded number of attempts — typically 3. Unlimited retries turn a transient problem into a prolonged one and can exhaust the caller's own resources if the downstream service is down for an extended period.
Idempotency key. Payment calls must carry a stable unique identifier (e.g., a UUID generated before the first attempt and reused on retries). This ensures that if the server received and processed the request before returning an error, the retry does not create a duplicate charge. Without idempotency, "retry on failure" becomes "double-charge on network hiccup."

Rubric: ✓ backoff + jitter ✓ attempt cap ✓ idempotency key — hitting all three is a strong answer. Hitting two suggests awareness of retry safety but incomplete; hitting one indicates the retry loop will likely cause production issues.

Key takeaways

Overload / traffic spikes: rate limiting at the edge prevents cascade triggers by enforcing a traffic ceiling before your service is overwhelmed.
Cascades: circuit breakers stop failure propagation at the source; bulkheads limit blast radius so one slow dependency can't exhaust resources for everything.
Bad deploys: staged and canary rollouts plus kill switches let you catch and revert problems fast, before they reach all users.
Data problems: schema validation at ingest and monitoring on data shape catch drift and corruption early, before they cause downstream failures.
Retry storms: exponential backoff with jitter keeps recovery from amplifying the original problem into a sustained outage.
Missing timeouts: every external call needs an explicit timeout — no exceptions — to convert "blocked forever" into "failed fast."
Single points of failure: redundancy and cell-based architecture eliminate single failure points so no one component can take the whole system offline.

Sources & further reading

These are starting points to go deeper — all original explanation above, grounded in public references:

Google SRE Book — the canonical reference for site reliability engineering, covering failure modes, error budgets, and operational patterns at scale.
AWS Builders' Library — Timeouts, retries, and backoff with jitter — practical guidance on making retry loops safe in distributed systems.
AWS Builders' Library — Avoiding overload in distributed systems — how Amazon approaches load shedding and back-pressure to prevent overload cascades.
AWS Builders' Library — Making retries safe with idempotent APIs — the mechanics and patterns for idempotency keys so that retries never create duplicate effects.