API Design

Failure Case Studies · Lesson 03

The 2017 AWS S3 Outage

One mistyped parameter on a routine maintenance command removed far more servers than intended. What should have been a brief intervention turned into a four-hour outage that rippled across the internet — and left the AWS status dashboard too broken to report on itself.

⏱ ~13 min Difficulty: advanced Prereq: fail-02

By the end you'll be able to

What happened

The morning of February 28, 2017 started with a routine debugging session. An AWS engineer was investigating a slowdown in a billing system and needed to remove a small number of servers from one of its supporting subsystems to resolve the issue. The removal was done with an internal maintenance tool — the kind of tool used regularly by operations teams. But something went wrong with the parameters.

  1. Feb 28, 2017, morning: An AWS engineer begins debugging a slowness issue in the billing system. A maintenance command is prepared to remove a small set of servers from a subsystem.
  2. Incorrect parameter entered: Due to a typo or input error, the command receives a larger value than intended. Rather than removing a small subset of servers, it targets a much larger group.
  3. S3 index servers removed: Among the servers taken offline is a substantial portion of the S3 index subsystem — the internal component responsible for tracking where every stored object lives across the fleet.
  4. Two dependent subsystems lose their index: The S3 metadata subsystem (which handles object attributes) and the S3 PUT subsystem (which accepts new object writes) both depend on the index. With the index gone, they cannot function.
  5. Recovery begins — and stalls: Engineers attempt to restart the affected subsystems. The restarts take far, far longer than expected. These subsystems had been running continuously for years; the restart procedures had never been exercised at this scale, and the subsystems had grown enormously over time.
  6. ~4 hours of degraded and unavailable S3 in US-EAST-1: For the better part of a morning, Amazon S3 in the US East (Northern Virginia) region is unable to serve reads, writes, or metadata requests reliably.
  7. Blast radius extends across the internet: AWS's own management console, the AWS Service Health Dashboard (which served its assets from S3), Slack, GitHub, Quora, business intelligence tools, IoT sensor fleets, and thousands of other services that depended on US-EAST-1 S3 either degraded or went dark. The very page AWS would use to communicate the outage was itself impaired.
S3 Index Subsystem (taken offline) S3 Metadata Subsystem S3 PUT Subsystem AWS Console (uses S3) Third-party Services Slack, GitHub, etc. Service Health Dashboard (uses S3!) IoT Devices config fetched from S3
Blast radius of the S3 index failure. Red arrows show direct subsystem dependencies; amber arrows show downstream services that relied on S3. The dashed border on the Health Dashboard signals its circular dependency — it was broken by the very outage it should have reported.

Root cause

No single decision caused the outage. Three separate conditions had to be true simultaneously, and when they aligned, the blast radius was catastrophic:

1. Human error amplified by a missing safety floor. The maintenance tool had no minimum-bound validation. An operator entering a wrong value could remove an arbitrarily large fraction of a tier in a single command execution. Systems theory calls this an "error-permissive" interface: it relies entirely on human care rather than encoding safe defaults into the tooling itself. When the care slipped — as it inevitably does — there was nothing to catch it.

2. Restart debt. The S3 index subsystem had grown considerably over the years and had been running continuously for a very long time. Nobody had exercised the restart procedure at its current scale. Because restarts had never been practiced, there were no benchmarks, no known-good runbooks with realistic time estimates, and no optimizations. When the restart finally became necessary, engineers discovered the hard way that it was a multi-hour process, not a multi-minute one. The longer a system runs without a restart, the more "restart debt" accumulates — and debts must eventually be paid.

3. Enormous blast radius from a single-region dependency. US-EAST-1 S3 had become a global dependency not just for AWS's own infrastructure but for a huge swath of the public internet. Many services had been built on the implicit assumption that S3 was always available, with no fallback or degraded-mode behavior. When that assumption broke, those services broke too — including the AWS Service Health Dashboard, which was served from the same S3 it was supposed to be monitoring.

The design lessons

The S3 outage illustrates several interconnected principles that belong in every reliability-focused system design conversation:

Blast radius reduction. The goal is to ensure that a failure in one component affects the smallest possible set of users and services. The S3 outage was so damaging in part because there was essentially no blast radius boundary: a single subsystem's failure could reach every service that had ever stored something in US-EAST-1. Good system design places hard boundaries — network, data, or logical — between failure domains.

Dependency isolation. The Service Health Dashboard depending on S3 is the canonical cautionary example of circular dependency between a service and its monitoring system. When the thing you monitor is also the thing your monitoring system runs on, you lose observability precisely when you need it most. Critical operational infrastructure — status pages, alerting pipelines, runbook repositories — must run on different infrastructure from the services they monitor.

Cell-based architecture. If the S3 index subsystem had been partitioned into smaller, independently deployable and restartable cells, the bad command would have taken down one cell rather than the entire subsystem. Cell architecture trades some resource efficiency for resilience: the blast radius of any single failure is bounded by the cell boundary. It also keeps restart times predictable because each cell is small enough to restart quickly.

Runbook and tooling safety guards. Runbooks are instructions written for humans; humans make errors. Safety guards are constraints encoded into the tools themselves. A minimum-floor validation on the drain command — "refuse to remove more than 5% of a tier without a secondary confirmation" — would have stopped this incident at the source. The lesson is that operational tools handling destructive actions must be designed like APIs: with validation, type constraints, and explicit confirmation steps for high-impact operations.

Graceful degradation. Services that depend on a storage layer should have a fallback posture: serve stale cached data, disable non-critical features, or return a graceful error rather than crashing entirely. Every layer of graceful degradation between a dependency failure and the end user is a reduction in blast radius.

🎯 Interview angle

The S3 outage is the canonical "blast radius" example in system design interviews. When asked about reliability for any distributed service, proactively state: "I'd partition this into cells so a failure affects only one cell at a time, and I'd ensure the status page and operational tooling don't depend on the system they monitor." Naming cell architecture and out-of-band observability signals that you've thought beyond the happy path.

⚠️ Common trap

Circular dependency between a service and its own status reporting. When S3 went down, the AWS Service Health Dashboard — the primary tool AWS uses to communicate service disruptions — was itself degraded because it rendered assets stored in S3. The very mechanism meant to inform customers about the outage was a casualty of the outage. This pattern is more common than it seems: logging pipelines that write to the database they monitor, alerting systems that use the same message broker they alert on, dashboards that pull config from the storage tier they track.

✅ Do this, not that

Maintenance and operational tools that can remove capacity should enforce hard limits in the tool itself, not just in the runbook. A command that drains servers should refuse to remove more than a defined percentage of a tier in a single invocation without a secondary confirmation step and a dry-run preview. Runbooks get skipped under pressure; hard constraints encoded in the tool cannot be skipped.

How to avoid it

Five concrete mitigations directly address the failure modes this incident revealed:

(a) Safety guards on destructive operational commands. Any command that removes capacity from a live system should enforce: a maximum percentage cap (e.g., refuse to remove more than 5% of a tier in one invocation), a dry-run mode that shows exactly which nodes would be affected before execution, and an explicit --confirm flag required to actually execute the destructive path. The guard must live in the tool, not just the procedure.

(b) Regular practice-restarts of critical subsystems. Subsystems should be restarted on a scheduled cadence — even when there is no operational reason to do so — to keep restart procedures exercised, benchmarked, and fast. If a restart takes 20 minutes today and you never practice it, it may take 4 hours when you actually need it. Treat restart time as a metric to be monitored and optimized, just like latency or error rate.

(c) Cell-based architecture. Partition large subsystems into independently deployable, independently restartable cells. Each cell owns a fraction of the keyspace or capacity. A bad command or a bad restart in cell-03 does not affect cells 01, 02, or 04. The blast radius of any single failure becomes bounded by the cell boundary. Cells also make capacity scaling and gradual rollouts far safer.

(d) Out-of-band status communication. The status page, alerting infrastructure, incident chat channels, and runbook hosting must not depend on the service they monitor. Host the status page on a separate CDN, in a different region, with no runtime dependency on the affected service. During an outage is exactly the wrong time to discover that your communications infrastructure is down too.

(e) Design for graceful degradation. Services that use S3 (or any durable storage) as a dependency should have a degraded mode: return the last cached version of data, disable write paths while reads continue, or return a structured error that allows the caller to handle the situation gracefully rather than crashing. The goal is that a storage outage causes a feature to work at reduced capacity, not to fail completely.

# Example: a drain-nodes admin command with safety guards

# Bad — no guardrails, accepts arbitrary count
drain-nodes --tier s3-index --count 1277   # removes 1277 nodes, no confirmation

# Good — dry-run + percentage cap + confirmation required
drain-nodes --tier s3-index --count 5 --dry-run
# → Would remove: node-042, node-107, node-219, node-304, node-418
# → That is 0.4% of tier (5 / 1277). Limit: 5%. Safe to proceed.
# → Re-run with --confirm to execute.

drain-nodes --tier s3-index --count 1277 --dry-run
# → ERROR: count 1277 is 100% of tier (limit: 5%).
# →        Reduce --count or request an override with --override-reason.

Under the hood: the precise failure mechanism

The outage unfolded through a deterministic chain of dependency failures. Each step below is traceable to the AWS post-event summary and the known internal architecture of S3.

  1. The maintenance tool had no minimum-floor validation. The internal drain tool accepted a --count parameter specifying how many servers to remove from a subsystem tier. No lower-bound guard existed: the tool would accept any integer value up to the full size of the tier without warning, without a preview, and without a confirmation prompt. The interface gave no indication of what percentage of the tier a given count represented.
  2. A typo sent the wrong count. The engineer intended to remove a small set of servers (roughly 20) from the billing subsystem. The value actually submitted was dramatically larger. AWS's post-event summary states that significantly more servers were removed than intended. Because the tool executed immediately — no dry-run mode, no "you are about to remove X% of the tier, continue?" step — the command completed before the error was recognized.
  3. 1,277 servers were removed from the S3 Index subsystem. The billing subsystem's supporting infrastructure shared capacity with S3's Index subsystem. When the drain command ran, it removed 1,277 servers from the s3-index tier — not a handful of billing nodes. The Index subsystem is the internal component that maps every stored object to its physical location across S3's distributed storage fleet. It is the foundational lookup layer that every other S3 operation depends on.
  4. Index loss cascades to Metadata and PUT via hard dependency. S3's internal architecture has a clear dependency graph: the Index subsystem holds the object → physical-location mapping; the Metadata subsystem tracks object attributes (size, ETag, ACL, storage class); the PUT subsystem accepts new object writes. Both Metadata and PUT must consult Index to function — Metadata to resolve where attribute records live, PUT to record where a new object should be written. When Index lost quorum due to insufficient server capacity, both Metadata and PUT began returning errors. There was no fallback path and no circuit breaker to gracefully degrade.
  5. Read operations also failed. S3 GET requests resolve object location through Index before fetching bytes from the storage fleet. With Index non-functional, even reads returned errors. The result: reads fail, writes fail, metadata operations fail. S3 in US-EAST-1 was effectively non-operational across all object operations.
  6. No cell partitioning meant all index capacity was one logical unit. A cell-based architecture would partition the Index subsystem into independently-operated shards — say, 20 cells each holding 1/20th of the object namespace. A drain command affecting one cell would degrade 5% of traffic while the other 19 cells continued serving requests normally. S3's Index was not partitioned this way in 2017. All index capacity was a single logical unit, so removing a large fraction of servers from the tier removed a large fraction of total index capacity — a blast radius proportional to the entire region's S3 traffic.
  7. Recovery stalled on in-memory state rebuild at massive scale. Engineers began restarting the Index subsystem to restore service. The restart process requires the subsystem to rebuild its in-memory index — the object → location table — from persistent storage before it can begin serving requests. This rebuild time is roughly linear in the number of entries: at S3's 2017 scale (trillions of objects), the table had grown enormous. The subsystem had been running continuously for years without a restart; nobody had benchmarked restart time at current scale, and there were no runbooks with realistic time estimates. What might have taken minutes at the scale when the procedure was last tested now took hours.
  8. Sequencing errors during restart extended the outage further. The correct restart sequence is: Index first, then its dependents (Metadata, then PUT). Attempting to restart Metadata or PUT before Index has fully rebuilt its state causes those subsystems to fail again immediately, requiring another restart cycle. Sequencing errors during the recovery window added additional delay on top of the Index rebuild time.
  9. The AWS Service Health Dashboard became a victim of the outage it was meant to report. The dashboard serving status.aws.amazon.com delivered its static assets — JavaScript, CSS, images — from Amazon S3 in us-east-1. When S3 in that region failed, the dashboard's assets became unretrievable. Customers attempting to check the AWS status page encountered a partially or fully degraded dashboard: the monitoring and communication tool had a circular dependency on the failing service.

Monitoring timeline

$ aws-internal drain-nodes --tier s3-index --count 1277

T+0:00 DRAIN COMPLETE — 1277 servers removed from s3-index tier (100% of tier)
T+0:02 QUORUM LOST — s3-index: insufficient capacity; quorum threshold not met
T+0:02 SUBSYS ERROR — s3-metadata: dependency s3-index unavailable; requests failing
T+0:02 SUBSYS ERROR — s3-put: dependency s3-index unavailable; requests failing
T+0:05 REGION DEGRADED — us-east-1 S3 GET/PUT error rate >99%; console assets unreachable
T+0:05 DASHBOARD — status.aws.amazon.com assets served from affected S3; dashboard degraded
T+0:15 RECOVERY — s3-index restart initiated; in-memory rebuild from persistent storage begun
T+0:30 REBUILD — s3-index rebuild in progress (estimated hours remaining; scale unknown)
T+1:00 SEQUENCING — s3-metadata restart attempted before index ready; restart loop detected
T+4:00 RESTORED — s3-index rebuild complete; metadata + PUT subsystems restarted in order; traffic recovering

Failure chain and the guardrail that would have contained each stage

Failure stage Technical cause Guardrail that would have contained it
Drain command removes 1,277 servers No minimum-floor or percentage-cap validation on --count; tool accepted any value up to full tier size Hard percentage cap (e.g. 5% of tier per invocation) + dry-run preview showing count and percentage + explicit --confirm required to execute
Execution without preview or confirmation Tool ran destructively on first invocation with no opt-in confirmation step and no --dry-run mode Require --dry-run as default behavior; destructive execution only with --confirm flag after reviewing preview output
Index quorum lost — entire region's S3 affected Index subsystem had no cell partitioning; all capacity was one logical unit, so partial removal degraded the whole tier Cell-based architecture: partition Index into independent cells so one cell's failure degrades only its share of traffic (e.g. 1/20th), not the entire region
Metadata and PUT subsystems fail immediately on Index loss Hard synchronous dependency on Index with no fallback, no graceful degradation, and no circuit breaker Circuit breakers on Index calls; degraded-mode behavior (e.g. serve cached metadata, queue PUT operations) when Index is unavailable
Index restart takes hours instead of minutes In-memory state rebuild is linear in data volume; subsystem had grown for years without anyone benchmarking restart time at current scale Regular chaos-engineering restarts of subsystems to measure and bound restart time; incremental startup that begins serving a subset of the namespace before full rebuild completes
Restart sequencing errors extend outage No enforced dependency ordering in the restart tooling; operators had to sequence manually under pressure Automated orchestrated restart that enforces correct dependency order: Index → Metadata → PUT, with each step gated on health-check passing
Status dashboard degraded during outage Dashboard static assets served from the same S3 region that was failing — circular dependency between the monitoring tool and the monitored service Host status dashboard assets in a separate region or CDN entirely independent of the service being monitored; never route the canary through the cage

🧠 Quick check

1. What compounded the AWS S3 outage from minutes to hours?

The S3 index and dependent subsystems had been running continuously for years. Nobody had exercised the restart at their current scale, so what might have taken minutes under normal circumstances turned into a multi-hour ordeal. This is what is sometimes called "restart debt."

2. Which architectural pattern would most directly limit the blast radius of a future similar error?

Cell-based architecture is specifically designed to contain failures. If the index subsystem is partitioned into, say, twenty cells, a bad drain command might affect one cell — not all twenty. The other nineteen cells continue serving traffic while cell-specific recovery happens.

3. Why was the AWS Service Health Dashboard itself affected during the S3 outage?

The AWS Service Health Dashboard served its static assets (JavaScript, CSS, images) from Amazon S3 in the same region that was experiencing the outage. This is the canonical circular-dependency failure: the monitoring and communication tool for a service was itself a dependent of that service.

✍️ Exercise: review a PR that adds a drain-node admin command

You're the reviewer. A pull request adds the following admin command to a storage orchestration service:

# New command: drain-nodes
# Removes the specified nodes from a storage tier.
# Usage: drain-nodes --tier <name> --nodes <comma-separated node IDs>
#
# Example:
drain-nodes --tier s3-index --nodes node-001,node-002,node-003

The command accepts any number of node IDs with no upper bound and no confirmation step. What guardrails would you require before approving this PR?

Model answer:

  1. Maximum percentage cap: The command must refuse to drain more than a defined percentage of the tier (e.g., 5%) in a single invocation. If the operator tries to remove more, the tool should print an error with the current tier size and the number they attempted to remove, then exit without making any change.
  2. Dry-run mode: Add a --dry-run flag that shows exactly which nodes would be drained, the percentage of the tier they represent, and an explicit prompt requiring the operator to re-run with --confirm to execute. The default behavior should be dry-run; destructive execution should require opting in.
  3. Out-of-band verification: The operational dashboard or runbook used to confirm that the drain completed successfully must not itself depend on the storage tier being drained. If you drain the S3 index tier, the dashboard confirming it completed should not be reading from S3.

Rubric: percentage cap + dry-run mode + out-of-band verification = full marks. Any two of the three = partial credit. Zero of the three = the PR should not merge.

Key takeaways

Sources & further reading

These links point to the primary source material for this incident and the AWS reliability guidance that followed it. All explanatory prose above is original.