API Design

Reliability & Scale · Lesson 06

Circuit Breaker Pattern

When a downstream service is drowning, the last thing it needs is your service pouring more requests on top. The circuit breaker is an automatic trip-switch: it detects the failure, cuts the flow, and only restores it once the downstream has had a chance to breathe.

⏱ 11 min Difficulty: core Prereq: Retries & Backoff (rel-05)

By the end you'll be able to

The problem: hammering a drowning neighbor

Picture a neighborhood sharing a single water main. One house springs a burst pipe; the water pressure in the whole street drops. Every other house starts running all their taps harder, trying to compensate — which drops the pressure even more. Now the problem has spread from one burst pipe to the entire street.

Microservices fail the same way. Service A calls Service B. B is overloaded and starts responding slowly. A's request threads pile up waiting for B to answer. A's own response times balloon. Services C and D that call A now start timing out. One slow dependency has metastasized into a full cascading failure — not because B was catastrophically broken, but because everyone kept hammering it instead of backing off.

This is the problem the circuit breaker was designed to solve. The name is borrowed directly from electrical engineering: a physical circuit breaker trips when current exceeds a safe threshold, protecting the wiring. In software, a circuit breaker trips when a downstream's error rate exceeds a threshold, protecting threads, connections, and users.

The three states

A circuit breaker wraps every call to an external dependency and maintains a small state machine with three states:

CLOSED Requests flow normally OPEN Requests fail fast (no call) HALF-OPEN One probe request allowed through Error rate > threshold (e.g., 50% in last 60 s) Cool-down expires (e.g., after 30 s) Probe succeeds Probe fails
The circuit breaker state machine. In CLOSED, calls pass through. When failures exceed a threshold, it trips to OPEN and calls fail fast without touching the dependency. After a cool-down period it enters HALF-OPEN and allows a single probe request. Success resets to CLOSED; failure returns to OPEN.

Closed

The normal operating state. All calls pass through to the dependency. The breaker counts successes and failures inside a rolling time window (e.g., the last 60 seconds). As long as the failure rate stays below a configured threshold (e.g., 50%), nothing changes.

Open

The breaker has tripped. Calls are rejected immediately — the dependency is not contacted at all. This "fail fast" behavior is crucial: instead of tying up threads waiting for a service that can't respond, the caller gets an instant error and can apply a fallback. The dependency simultaneously gets zero additional load, giving it room to recover.

Half-Open

After a configured cool-down period (e.g., 30 seconds), the breaker enters a cautious probe state. It allows exactly one (or a small fixed number of) request(s) through to the real dependency. If the probe succeeds, the circuit closes and normal traffic resumes. If the probe fails, the circuit trips back to OPEN and the cool-down restarts. This gradual re-entry prevents immediately flooding a fragile but recovering service.

Configurable thresholds

A production circuit breaker needs at least four tuneable parameters:

ParameterPurposeExample default
Window durationHow long to count failures over60 s rolling
Failure thresholdError rate % that trips the breaker50%
Minimum request countIgnore threshold until N requests seen (avoids tripping on startup noise)10 requests
Cool-down (sleep window)How long to stay OPEN before probing30 s

The minimum request count matters: if you've only had 2 requests and both failed, that's 100% — but it's not statistically meaningful. Without a floor count, a breaker trips the moment a service boots up and the first two warmup calls are slow.

Fallbacks and graceful degradation

A circuit breaker alone just replaces a slow failure with a fast one. The real value is pairing it with a fallback: an alternative behavior when the dependency is unavailable. Fallback options, roughly from best to worst user experience:

  1. Cached data — return the last known-good response if it's still within an acceptable staleness window.
  2. Default / empty state — return an empty list, a zero count, or a neutral default rather than an error.
  3. Partial response — omit the unavailable section and flag it: { "recommendations": null, "recommendations_available": false }.
  4. Queue for later — accept the write and process it once the dependency recovers (for async operations).
  5. Clear error message — if no fallback is viable, return a user-friendly message rather than a 500.

The goal is graceful degradation: a service that behaves "less well" during a dependency outage is far preferable to one that becomes completely unavailable.

Pairing with timeouts and bulkheads

A circuit breaker without a timeout is toothless. If calls to the dependency have no deadline, a thread can wait indefinitely — the circuit never accumulates enough failures to trip because most requests are "pending" rather than "failed". Set an explicit connection timeout and read timeout on every outbound call. Typically these are separate: connection timeout 500 ms, read timeout 5 s.

A bulkhead complements this: it limits the number of concurrent calls to a single dependency using a dedicated thread pool or semaphore. If the pool is exhausted, new requests fail fast immediately rather than queuing — containing the blast radius to calls destined for that one dependency, not the entire caller service.

Worked example: minimal circuit breaker

// Pseudo-code: SimpleCircuitBreaker
// Wraps any callable that can fail (e.g., an HTTP client call).

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold  = options.failureThreshold  ?? 0.5;  // 50%
    this.minRequests       = options.minRequests       ?? 10;
    this.windowMs          = options.windowMs          ?? 60_000;
    this.cooldownMs        = options.cooldownMs        ?? 30_000;

    this.state         = "CLOSED";
    this.openedAt      = null;
    this.calls         = [];  // ring-buffer of { ts, success }
  }

  // Call this instead of calling the dependency directly.
  async call(fn, fallback) {
    if (this.state === "OPEN") {
      const elapsed = Date.now() - this.openedAt;
      if (elapsed < this.cooldownMs) {
        // Fail fast — don't touch the dependency
        return fallback ? fallback() : throw new Error("Circuit OPEN");
      }
      // Cool-down expired → probe
      this.state = "HALF_OPEN";
    }

    try {
      const result = await fn();
      this._record(true);
      if (this.state === "HALF_OPEN") {
        this.state = "CLOSED";  // probe passed — reopen traffic
        this.calls = [];
      }
      return result;
    } catch (err) {
      this._record(false);
      if (this.state === "HALF_OPEN") {
        this._trip();  // probe failed — stay OPEN
      }
      throw err;
    }
  }

  _record(success) {
    const now = Date.now();
    // Evict entries outside the rolling window
    this.calls = this.calls.filter(c => now - c.ts < this.windowMs);
    this.calls.push({ ts: now, success });

    if (this.calls.length >= this.minRequests) {
      const failures = this.calls.filter(c => !c.success).length;
      if (failures / this.calls.length > this.failureThreshold) {
        this._trip();
      }
    }
  }

  _trip() {
    this.state    = "OPEN";
    this.openedAt = Date.now();
  }
}

// Usage:
const breaker = new CircuitBreaker({ failureThreshold: 0.5, cooldownMs: 30_000 });

async function getUserProfile(userId) {
  return breaker.call(
    () => profileService.get(userId),          // real call
    () => ({ name: "Unknown", avatar: null })  // fallback
  );
}
🎯 Interview angle

"How would you protect your service from a flaky downstream dependency?" The answer that stands out: (1) set a timeout on every outbound call; (2) wrap calls in a circuit breaker with explicit thresholds; (3) define a fallback for when the circuit is open; (4) monitor the breaker state as a metric so an engineer knows when a dependency has tripped. Tying this to retries (rel-05) and bulkheads shows you understand the full resilience stack — that's senior-level thinking.

⚠️ Common trap

No timeout → thread/connection exhaustion. Without a read timeout, threads accumulate waiting for a dependency that never replies. Your thread pool fills, new requests queue, the queue fills, your service stops processing anything — and your circuit breaker never trips because calls are "pending" rather than "failed". Always set both a connection timeout and a read timeout on every outbound call, making timeout failures count toward the breaker's failure rate.

Cascading failure without a circuit breaker. Service A calls B; B is slow. A's threads wait. A becomes slow. C calls A; C's threads wait. C becomes slow. One overloaded leaf node has taken down the entire call chain. A circuit breaker at each service boundary — with a fast fallback — contains the blast to the single dependency, not the entire graph.

✅ Do this, not that

Do: pair every circuit breaker with an explicit timeout, a meaningful fallback, and an observable metric (state + error rate). Tune thresholds in staging before going to production — a threshold that's too sensitive will flip the circuit on normal traffic spikes. Don't: share one circuit breaker instance across multiple logical dependencies — each dependency gets its own breaker, so a failure in the recommendation engine doesn't trip the breaker protecting the payment service.

Under the hood: the state machine implemented

The three states are easy to draw in a diagram. What matters for real debugging is knowing exactly which variables drive the transitions and how to trace a breaker through a failure sequence. A breaker needs five pieces of state: a failure counter (or rolling window), a request counter, a threshold ratio, an open-timestamp, and the current state enum.

Worked numeric example. Parameters: failureThreshold = 0.5, minRequests = 10, windowMs = 60_000, cooldownMs = 30_000. Service B starts failing at T=0.

Time (s)EventWindow callsFailuresFailure rateState
T=0 to T=58 successful calls, 0 failures800% (min not met)CLOSED
T=6Call 9 fails9111% (min not met)CLOSED
T=7Call 10 fails10220% — min met, below 50%CLOSED
T=8 to T=12Calls 11–15 all fail15747% — still below 50%CLOSED
T=13Call 16 fails → 8/16 = 50% — threshold hit16850% ≥ threshold → TRIPOPEN (openedAt=T+13)
T=14 to T=42All incoming calls fail fast, upstream gets 0 requestsOPEN
T=43elapsed = 30 s ≥ cooldownMs → probe allowedHALF-OPEN
T=43 (probe)Probe succeeds → reset window100%CLOSED

If instead the probe at T=43 fails, openedAt is reset to T=43, and the 30 s cool-down restarts — the next probe is allowed at T=73.

Pseudo-code implementing the transitions with comments on which line drives which row above:

// State variables (per breaker instance, per dependency)
state     = "CLOSED"
openedAt  = null
window    = []  // rolling array of {ts, success}

function call(fn, fallback) {
  if (state === "OPEN") {
    if (Date.now() - openedAt < COOLDOWN_MS) {
      return fallback()   // fail fast — rows T=14..T=42 above
    }
    state = "HALF_OPEN"    // row T=43
  }

  try {
    result = fn()         // actual dependency call
    record(success=true)
    if (state === "HALF_OPEN") {
      state  = "CLOSED"   // probe passed — row "Probe succeeds"
      window = []
    }
    return result
  } catch (err) {
    record(success=false)
    if (state === "HALF_OPEN") trip()  // probe failed — reset cool-down
    throw err
  }
}

function record(success) {
  now    = Date.now()
  window = window.filter(e => now - e.ts < WINDOW_MS)  // evict stale
  window.push({ts: now, success})

  if (window.length >= MIN_REQUESTS) {             // row T=7 — min-requests gate
    failures = window.filter(e => !e.success).length
    if (failures / window.length >= THRESHOLD) trip() // row T=13
  }
}

function trip() {
  state    = "OPEN"
  openedAt = Date.now()
}

How to debug & inspect it

A circuit breaker that has tripped is silent by default — requests fail fast with no call to the upstream, so the upstream logs show nothing. If you're seeing sudden 5xx/client errors without any corresponding upstream activity, the breaker is the first place to look.

# 1. Check current breaker state via a metrics endpoint $ curl -s http://localhost:9090/metrics | grep 'circuit_breaker' circuit_breaker_state{dependency="recommendations-svc"} 1 circuit_breaker_state{dependency="inventory-svc"} 2 # 0=CLOSED 1=HALF_OPEN 2=OPEN circuit_breaker_failure_rate{dependency="inventory-svc"} 0.72 # 72% failure rate in the rolling window — clearly past the 50% threshold # 2. Check when it tripped (openedAt) and how long it has been open $ curl -s http://localhost:8080/actuator/circuitbreakerevents/inventory-svc | jq '.circuitBreakerEvents[-3:]' [{"type":"STATE_TRANSITION","stateTransition":"CLOSED_TO_OPEN","creationTime":"2025-06-20T10:14:00Z"}, {"type":"STATE_TRANSITION","stateTransition":"OPEN_TO_HALF_OPEN","creationTime":"2025-06-20T10:14:30Z"}, {"type":"STATE_TRANSITION","stateTransition":"HALF_OPEN_TO_OPEN","creationTime":"2025-06-20T10:14:31Z"}] # Breaker tripped at 10:14:00, probe allowed at 10:14:30, probe failed at 10:14:31 — service still down

The key Prometheus metrics to expose from a circuit breaker and alert on:

MetricWhat it tells youAlert condition
circuit_breaker_state{dep="X"}Current state (0/1/2 = CLOSED/HALF/OPEN)Alert if OPEN for > 5 min (dependency stuck down)
circuit_breaker_failure_rate{dep="X"}Rolling failure rate in the windowAlert at >30% even before tripping — catch degradation early
circuit_breaker_calls_total{outcome="short_circuited"}Requests that hit the OPEN breaker and returned fallbackSpike = breaker is open; count the user impact
circuit_breaker_slow_call_rate{dep="X"}Fraction of calls exceeding the slow-call thresholdAlert at >20% — slowness often precedes failures

Tuning thresholds — the two failure modes:

SymptomLikely causeFix
Breaker won't close — stays OPEN indefinitelyDependency is genuinely still down; OR the probe itself is timing out because the timeout is shorter than the recovery timeVerify the downstream is actually healthy (direct health check); increase the probe timeout; lengthen the cool-down to give the service more recovery time
Breaker flaps — rapidly oscillates OPEN ↔ HALF-OPEN ↔ OPENCool-down too short; dependency is partially recovered (intermittent failures); or single-probe half-open is too aggressive for a slowly recovering serviceIncrease cooldownMs; configure the half-open probe to require multiple consecutive successes before closing (e.g. 3 successes)
Breaker trips on startup / deploymentminRequests too low; first few warmup calls are slow and counted as failuresIncrease minRequests to at least 10–20; exclude health-check calls from the counter
Breaker never trips during an obvious outageTimeout not set — calls hang as "pending" and aren't recorded as failures; OR errors are caught silently before reaching the breakerSet an explicit read timeout; ensure all error paths (timeouts, connection resets) are counted as failures in record(success=false)
Fallback returns stale/wrong data without any indicationNo observability on fallback activationsLog and increment a circuit_breaker_fallback_total counter every time a fallback fires; set an alert on it

Breaker tuning checklist:

  1. Expose current state and failure rate as Prometheus metrics; set an alert for OPEN > 5 min.
  2. Set a conservative minRequests (≥10) to avoid tripping on startup noise.
  3. Verify every outbound call has both a connect timeout and a read timeout, and that timeouts are counted as failures by the breaker.
  4. Configure a meaningful fallback — not just a bare error, but a degraded response or cached value.
  5. Test the breaker in staging: inject failures (return 500 from a mock upstream) and verify the breaker trips at the expected failure count, stays open for the cool-down period, and closes on probe success.
  6. Each dependency gets its own breaker instance — never share a breaker across two different downstream services.

By the numbers

Scenario: Service A calls a recommendation engine (Service B) at 200 req/s. Service B has a 5 s read timeout. A rolling window of 20 calls and a failure threshold of 50% are configured, with a 30 s cool-down.

Rolling window trace: when does the breaker trip?

Service B starts failing at T=0. The breaker window holds the last 20 calls. Each row below shows the window state after the indicated call:

Time (s)EventWindow callsFailures in windowFailure rateState
T=0 – T=8Calls 1–16: 10 succeed, 6 fail16637.5% — below 50%CLOSED
T=9Call 17 fails → 7/17 = 41.2%17741.2%CLOSED
T=10Call 18 fails → 8/18 = 44.4%18844.4%CLOSED
T=11Call 19 fails → 9/19 = 47.4%19947.4%CLOSED
T=12Call 20 fails → 10/20 = 50% ≥ threshold201050% → TRIPOPEN (openedAt=T+12)
T=12 – T=42All calls fail fast (fallback returned, no request to B)OPEN
T=42elapsed = 30 s ≥ cooldown → probe allowedHALF-OPEN
T=42 (probe)Probe succeeds → window reset100%CLOSED

Source for rolling-window mechanics: Martin Fowler — CircuitBreaker; production parameters from Azure Architecture Center — Circuit Breaker pattern.

Calls and resources saved by failing fast

Once OPEN at T=12, the breaker blocks all requests to Service B for the 30 s cool-down. At 200 req/s, how much waste is avoided?

calls_blocked = 200 req/s × 30 s = 6 000 hung calls averted resource_saved = 6 000 calls × 5 s timeout = 30 000 thread-seconds # Without a circuit breaker, 6 000 threads sit blocked for up to 5 s each. # A typical app server with a 200-thread pool would exhaust its pool in: # 200 threads / 200 req/s = 1 second # → the entire service freezes within 1 s of B becoming slow. # With the breaker open, all 6 000 calls return instantly (fallback), # freeing threads to handle other traffic.

In concrete terms: the breaker converts 30 000 thread-seconds of blocked capacity into near-zero thread-time (a fast fallback return costs microseconds). That is the quantified value of failing fast.

Decision math: threshold vs. flap trade-off

Choosing the failure threshold and minimum request count involves two opposing risks:

ThresholdMin requestsTrips after…Risk
20%51 failure in 5 callsFalse positives: normal startup noise or a single slow call trips the breaker
50%2010 failures in 20 calls (as traced above)Balanced: statistically meaningful before tripping; 30 s cool-down contains damage
80%5040 failures in 50 callsTrips too late: 40 failed calls × 5 s = 200 thread-seconds wasted before OPEN

Break-even: the threshold should trip the breaker before thread-pool exhaustion. With a pool of P threads, an incoming rate of R req/s, and a timeout of T seconds, the pool fills in P / R seconds. The breaker must trip within that window. With P=200 threads, R=200 req/s, T=5 s: pool fills in 1 s, so the minimum-requests window must be ≤ R × 1 s = 200 calls. The trace above (20-call window, trips after ~12 s at 200 req/s) comfortably beats this threshold.

🧠 Quick check

1. Why does a circuit breaker enter the HALF-OPEN state instead of going directly from OPEN back to CLOSED?

Going directly from OPEN to CLOSED would dump the full backlog of queued requests onto a fragile, recovering service — potentially tripping it again immediately. HALF-OPEN lets through just one probe request as a controlled test of recovery before reopening the floodgates.

2. Service A calls Service B with no timeout and no circuit breaker. B becomes slow (responds in 45 s). What happens to A's thread pool?

Without timeouts, threads hang indefinitely. A bounded thread pool fills completely, new work queues behind it, and the queue eventually overflows. A goes down — not because A's own code is broken, but because it has no mechanism to bound the damage from B's slowness.

3. You have a minimum request count of 10 and a failure threshold of 60%. After 5 requests, all 5 have failed (100%). Does the circuit trip?

The minimum request count is a statistical guard. With only 5 samples, the failure rate is not yet statistically meaningful — those 5 failures might be startup noise. The breaker only evaluates the threshold once at least 10 requests have been recorded.

4. When a circuit is OPEN, what is the best behavior for a non-critical feature like "related product recommendations"?

Graceful degradation: the page still loads and the core experience works; the recommendations section is quietly empty. Returning 500 degrades the entire page for one optional feature. Blocking is worse — it ties up resources and defeats the purpose of the circuit breaker.

✍️ Exercise: design circuit breakers for an e-commerce checkout

An e-commerce checkout page calls three downstream services: (1) a Payment service, (2) a Fraud Detection service, and (3) a Loyalty Points service. Payments are critical; Fraud Detection is important but can be skipped with a conservative fallback; Loyalty Points are optional. Design circuit breaker policies — thresholds, cool-down, and fallbacks — for each. What happens during a Fraud Detection outage? What happens during a Payment outage?

Model answer:

Rubric: ✓ different thresholds per dependency criticality ✓ meaningful fallback per service ✓ correct escalation: Payment = hard failure, Fraud = conservative default, Points = async degrade ✓ user experience considered in fallback wording. Four out of four = excellent.

Key takeaways

Sources & further reading