Reliability & Scale · Lesson 06

Circuit Breaker Pattern

When a downstream service is drowning, the last thing it needs is your service pouring more requests on top. The circuit breaker is an automatic trip-switch: it detects the failure, cuts the flow, and only restores it once the downstream has had a chance to breathe.

⏱ 11 min Difficulty: core Prereq: Retries & Backoff (rel-05)

By the end you'll be able to

Describe the three circuit-breaker states and the conditions that transition between them.
Explain why hammering a failing dependency is dangerous and how a circuit breaker prevents it.
Sketch a circuit breaker in code, including thresholds, cool-down, and graceful degradation.

The problem: hammering a drowning neighbor

Picture a neighborhood sharing a single water main. One house springs a burst pipe; the water pressure in the whole street drops. Every other house starts running all their taps harder, trying to compensate — which drops the pressure even more. Now the problem has spread from one burst pipe to the entire street.

Microservices fail the same way. Service A calls Service B. B is overloaded and starts responding slowly. A's request threads pile up waiting for B to answer. A's own response times balloon. Services C and D that call A now start timing out. One slow dependency has metastasized into a full cascading failure — not because B was catastrophically broken, but because everyone kept hammering it instead of backing off.

This is the problem the circuit breaker was designed to solve. The name is borrowed directly from electrical engineering: a physical circuit breaker trips when current exceeds a safe threshold, protecting the wiring. In software, a circuit breaker trips when a downstream's error rate exceeds a threshold, protecting threads, connections, and users.

The three states

A circuit breaker wraps every call to an external dependency and maintains a small state machine with three states:

The circuit breaker state machine. In CLOSED, calls pass through. When failures exceed a threshold, it trips to OPEN and calls fail fast without touching the dependency. After a cool-down period it enters HALF-OPEN and allows a single probe request. Success resets to CLOSED; failure returns to OPEN.

Closed

The normal operating state. All calls pass through to the dependency. The breaker counts successes and failures inside a rolling time window (e.g., the last 60 seconds). As long as the failure rate stays below a configured threshold (e.g., 50%), nothing changes.

Open

The breaker has tripped. Calls are rejected immediately — the dependency is not contacted at all. This "fail fast" behavior is crucial: instead of tying up threads waiting for a service that can't respond, the caller gets an instant error and can apply a fallback. The dependency simultaneously gets zero additional load, giving it room to recover.

Half-Open

After a configured cool-down period (e.g., 30 seconds), the breaker enters a cautious probe state. It allows exactly one (or a small fixed number of) request(s) through to the real dependency. If the probe succeeds, the circuit closes and normal traffic resumes. If the probe fails, the circuit trips back to OPEN and the cool-down restarts. This gradual re-entry prevents immediately flooding a fragile but recovering service.

Configurable thresholds

A production circuit breaker needs at least four tuneable parameters:

Parameter	Purpose	Example default
Window duration	How long to count failures over	60 s rolling
Failure threshold	Error rate % that trips the breaker	50%
Minimum request count	Ignore threshold until N requests seen (avoids tripping on startup noise)	10 requests
Cool-down (sleep window)	How long to stay OPEN before probing	30 s

The minimum request count matters: if you've only had 2 requests and both failed, that's 100% — but it's not statistically meaningful. Without a floor count, a breaker trips the moment a service boots up and the first two warmup calls are slow.

Fallbacks and graceful degradation

A circuit breaker alone just replaces a slow failure with a fast one. The real value is pairing it with a fallback: an alternative behavior when the dependency is unavailable. Fallback options, roughly from best to worst user experience:

Cached data — return the last known-good response if it's still within an acceptable staleness window.
Default / empty state — return an empty list, a zero count, or a neutral default rather than an error.
Partial response — omit the unavailable section and flag it: { "recommendations": null, "recommendations_available": false }.
Queue for later — accept the write and process it once the dependency recovers (for async operations).
Clear error message — if no fallback is viable, return a user-friendly message rather than a 500.

The goal is graceful degradation: a service that behaves "less well" during a dependency outage is far preferable to one that becomes completely unavailable.

Pairing with timeouts and bulkheads

A circuit breaker without a timeout is toothless. If calls to the dependency have no deadline, a thread can wait indefinitely — the circuit never accumulates enough failures to trip because most requests are "pending" rather than "failed". Set an explicit connection timeout and read timeout on every outbound call. Typically these are separate: connection timeout 500 ms, read timeout 5 s.

A bulkhead complements this: it limits the number of concurrent calls to a single dependency using a dedicated thread pool or semaphore. If the pool is exhausted, new requests fail fast immediately rather than queuing — containing the blast radius to calls destined for that one dependency, not the entire caller service.

Worked example: minimal circuit breaker

// Pseudo-code: SimpleCircuitBreaker
// Wraps any callable that can fail (e.g., an HTTP client call).

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold  = options.failureThreshold  ?? 0.5;  // 50%
    this.minRequests       = options.minRequests       ?? 10;
    this.windowMs          = options.windowMs          ?? 60_000;
    this.cooldownMs        = options.cooldownMs        ?? 30_000;

    this.state         = "CLOSED";
    this.openedAt      = null;
    this.calls         = [];  // ring-buffer of { ts, success }
  }

  // Call this instead of calling the dependency directly.
  async call(fn, fallback) {
    if (this.state === "OPEN") {
      const elapsed = Date.now() - this.openedAt;
      if (elapsed < this.cooldownMs) {
        // Fail fast — don't touch the dependency
        return fallback ? fallback() : throw new Error("Circuit OPEN");
      }
      // Cool-down expired → probe
      this.state = "HALF_OPEN";
    }

    try {
      const result = await fn();
      this._record(true);
      if (this.state === "HALF_OPEN") {
        this.state = "CLOSED";  // probe passed — reopen traffic
        this.calls = [];
      }
      return result;
    } catch (err) {
      this._record(false);
      if (this.state === "HALF_OPEN") {
        this._trip();  // probe failed — stay OPEN
      }
      throw err;
    }
  }

  _record(success) {
    const now = Date.now();
    // Evict entries outside the rolling window
    this.calls = this.calls.filter(c => now - c.ts < this.windowMs);
    this.calls.push({ ts: now, success });

    if (this.calls.length >= this.minRequests) {
      const failures = this.calls.filter(c => !c.success).length;
      if (failures / this.calls.length > this.failureThreshold) {
        this._trip();
      }
    }
  }

  _trip() {
    this.state    = "OPEN";
    this.openedAt = Date.now();
  }
}

// Usage:
const breaker = new CircuitBreaker({ failureThreshold: 0.5, cooldownMs: 30_000 });

async function getUserProfile(userId) {
  return breaker.call(
    () => profileService.get(userId),          // real call
    () => ({ name: "Unknown", avatar: null })  // fallback
  );
}

🎯 Interview angle

"How would you protect your service from a flaky downstream dependency?" The answer that stands out: (1) set a timeout on every outbound call; (2) wrap calls in a circuit breaker with explicit thresholds; (3) define a fallback for when the circuit is open; (4) monitor the breaker state as a metric so an engineer knows when a dependency has tripped. Tying this to retries (rel-05) and bulkheads shows you understand the full resilience stack — that's senior-level thinking.

⚠️ Common trap

No timeout → thread/connection exhaustion. Without a read timeout, threads accumulate waiting for a dependency that never replies. Your thread pool fills, new requests queue, the queue fills, your service stops processing anything — and your circuit breaker never trips because calls are "pending" rather than "failed". Always set both a connection timeout and a read timeout on every outbound call, making timeout failures count toward the breaker's failure rate.

Cascading failure without a circuit breaker. Service A calls B; B is slow. A's threads wait. A becomes slow. C calls A; C's threads wait. C becomes slow. One overloaded leaf node has taken down the entire call chain. A circuit breaker at each service boundary — with a fast fallback — contains the blast to the single dependency, not the entire graph.

✅ Do this, not that

Do: pair every circuit breaker with an explicit timeout, a meaningful fallback, and an observable metric (state + error rate). Tune thresholds in staging before going to production — a threshold that's too sensitive will flip the circuit on normal traffic spikes. Don't: share one circuit breaker instance across multiple logical dependencies — each dependency gets its own breaker, so a failure in the recommendation engine doesn't trip the breaker protecting the payment service.

Under the hood: the state machine implemented

The three states are easy to draw in a diagram. What matters for real debugging is knowing exactly which variables drive the transitions and how to trace a breaker through a failure sequence. A breaker needs five pieces of state: a failure counter (or rolling window), a request counter, a threshold ratio, an open-timestamp, and the current state enum.

Worked numeric example. Parameters: failureThreshold = 0.5, minRequests = 10, windowMs = 60_000, cooldownMs = 30_000. Service B starts failing at T=0.

Time (s)	Event	Window calls	Failures	Failure rate	State
T=0 to T=5	8 successful calls, 0 failures	8	0	0% (min not met)	CLOSED
T=6	Call 9 fails	9	1	11% (min not met)	CLOSED
T=7	Call 10 fails	10	2	20% — min met, below 50%	CLOSED
T=8 to T=12	Calls 11–15 all fail	15	7	47% — still below 50%	CLOSED
T=13	Call 16 fails → 8/16 = 50% — threshold hit	16	8	50% ≥ threshold → TRIP	OPEN (openedAt=T+13)
T=14 to T=42	All incoming calls fail fast, upstream gets 0 requests	—	—	—	OPEN
T=43	elapsed = 30 s ≥ cooldownMs → probe allowed	—	—	—	HALF-OPEN
T=43 (probe)	Probe succeeds → reset window	1	0	0%	CLOSED

If instead the probe at T=43 fails, openedAt is reset to T=43, and the 30 s cool-down restarts — the next probe is allowed at T=73.

Pseudo-code implementing the transitions with comments on which line drives which row above:

// State variables (per breaker instance, per dependency)
state     = "CLOSED"
openedAt  = null
window    = []  // rolling array of {ts, success}

function call(fn, fallback) {
  if (state === "OPEN") {
    if (Date.now() - openedAt < COOLDOWN_MS) {
      return fallback()   // fail fast — rows T=14..T=42 above
    }
    state = "HALF_OPEN"    // row T=43
  }

  try {
    result = fn()         // actual dependency call
    record(success=true)
    if (state === "HALF_OPEN") {
      state  = "CLOSED"   // probe passed — row "Probe succeeds"
      window = []
    }
    return result
  } catch (err) {
    record(success=false)
    if (state === "HALF_OPEN") trip()  // probe failed — reset cool-down
    throw err
  }
}

function record(success) {
  now    = Date.now()
  window = window.filter(e => now - e.ts < WINDOW_MS)  // evict stale
  window.push({ts: now, success})

  if (window.length >= MIN_REQUESTS) {             // row T=7 — min-requests gate
    failures = window.filter(e => !e.success).length
    if (failures / window.length >= THRESHOLD) trip() // row T=13
  }
}

function trip() {
  state    = "OPEN"
  openedAt = Date.now()
}

How to debug & inspect it

A circuit breaker that has tripped is silent by default — requests fail fast with no call to the upstream, so the upstream logs show nothing. If you're seeing sudden 5xx/client errors without any corresponding upstream activity, the breaker is the first place to look.

# 1. Check current breaker state via a metrics endpoint $ curl -s http://localhost:9090/metrics | grep 'circuit_breaker' circuit_breaker_state{dependency="recommendations-svc"} 1 circuit_breaker_state{dependency="inventory-svc"} 2 # 0=CLOSED 1=HALF_OPEN 2=OPEN circuit_breaker_failure_rate{dependency="inventory-svc"} 0.72 # 72% failure rate in the rolling window — clearly past the 50% threshold # 2. Check when it tripped (openedAt) and how long it has been open $ curl -s http://localhost:8080/actuator/circuitbreakerevents/inventory-svc | jq '.circuitBreakerEvents[-3:]' [{"type":"STATE_TRANSITION","stateTransition":"CLOSED_TO_OPEN","creationTime":"2025-06-20T10:14:00Z"}, {"type":"STATE_TRANSITION","stateTransition":"OPEN_TO_HALF_OPEN","creationTime":"2025-06-20T10:14:30Z"}, {"type":"STATE_TRANSITION","stateTransition":"HALF_OPEN_TO_OPEN","creationTime":"2025-06-20T10:14:31Z"}] # Breaker tripped at 10:14:00, probe allowed at 10:14:30, probe failed at 10:14:31 — service still down

The key Prometheus metrics to expose from a circuit breaker and alert on:

Metric	What it tells you	Alert condition
`circuit_breaker_state{dep="X"}`	Current state (0/1/2 = CLOSED/HALF/OPEN)	Alert if OPEN for > 5 min (dependency stuck down)
`circuit_breaker_failure_rate{dep="X"}`	Rolling failure rate in the window	Alert at >30% even before tripping — catch degradation early
`circuit_breaker_calls_total{outcome="short_circuited"}`	Requests that hit the OPEN breaker and returned fallback	Spike = breaker is open; count the user impact
`circuit_breaker_slow_call_rate{dep="X"}`	Fraction of calls exceeding the slow-call threshold	Alert at >20% — slowness often precedes failures

Tuning thresholds — the two failure modes:

Threshold too sensitive (e.g. 20% with minRequests=5): the breaker trips on brief traffic spikes or normal startup noise, causing false-positive outages. Symptoms: breaker flips open and closed rapidly ("flapping"), logs show OPEN→HALF_OPEN→OPEN cycles within seconds.
Threshold too permissive (e.g. 90%): the breaker doesn't trip until the dependency is almost completely down, so your service keeps hammering a failing upstream and the cascading failure isn't contained early enough.

Symptom	Likely cause	Fix
Breaker won't close — stays OPEN indefinitely	Dependency is genuinely still down; OR the probe itself is timing out because the timeout is shorter than the recovery time	Verify the downstream is actually healthy (direct health check); increase the probe timeout; lengthen the cool-down to give the service more recovery time
Breaker flaps — rapidly oscillates OPEN ↔ HALF-OPEN ↔ OPEN	Cool-down too short; dependency is partially recovered (intermittent failures); or single-probe half-open is too aggressive for a slowly recovering service	Increase `cooldownMs`; configure the half-open probe to require multiple consecutive successes before closing (e.g. 3 successes)
Breaker trips on startup / deployment	`minRequests` too low; first few warmup calls are slow and counted as failures	Increase `minRequests` to at least 10–20; exclude health-check calls from the counter
Breaker never trips during an obvious outage	Timeout not set — calls hang as "pending" and aren't recorded as failures; OR errors are caught silently before reaching the breaker	Set an explicit read timeout; ensure all error paths (timeouts, connection resets) are counted as failures in `record(success=false)`
Fallback returns stale/wrong data without any indication	No observability on fallback activations	Log and increment a `circuit_breaker_fallback_total` counter every time a fallback fires; set an alert on it

Breaker tuning checklist:

Expose current state and failure rate as Prometheus metrics; set an alert for OPEN > 5 min.
Set a conservative minRequests (≥10) to avoid tripping on startup noise.
Verify every outbound call has both a connect timeout and a read timeout, and that timeouts are counted as failures by the breaker.
Configure a meaningful fallback — not just a bare error, but a degraded response or cached value.
Test the breaker in staging: inject failures (return 500 from a mock upstream) and verify the breaker trips at the expected failure count, stays open for the cool-down period, and closes on probe success.
Each dependency gets its own breaker instance — never share a breaker across two different downstream services.

By the numbers

Scenario: Service A calls a recommendation engine (Service B) at 200 req/s. Service B has a 5 s read timeout. A rolling window of 20 calls and a failure threshold of 50% are configured, with a 30 s cool-down.

Rolling window trace: when does the breaker trip?

Service B starts failing at T=0. The breaker window holds the last 20 calls. Each row below shows the window state after the indicated call:

Time (s)	Event	Window calls	Failures in window	Failure rate	State
T=0 – T=8	Calls 1–16: 10 succeed, 6 fail	16	6	37.5% — below 50%	CLOSED
T=9	Call 17 fails → 7/17 = 41.2%	17	7	41.2%	CLOSED
T=10	Call 18 fails → 8/18 = 44.4%	18	8	44.4%	CLOSED
T=11	Call 19 fails → 9/19 = 47.4%	19	9	47.4%	CLOSED
T=12	Call 20 fails → 10/20 = 50% ≥ threshold	20	10	50% → TRIP	OPEN (openedAt=T+12)
T=12 – T=42	All calls fail fast (fallback returned, no request to B)	—	—	—	OPEN
T=42	elapsed = 30 s ≥ cooldown → probe allowed	—	—	—	HALF-OPEN
T=42 (probe)	Probe succeeds → window reset	1	0	0%	CLOSED

Source for rolling-window mechanics: Martin Fowler — CircuitBreaker; production parameters from Azure Architecture Center — Circuit Breaker pattern.

Calls and resources saved by failing fast

Once OPEN at T=12, the breaker blocks all requests to Service B for the 30 s cool-down. At 200 req/s, how much waste is avoided?

calls_blocked = 200 req/s × 30 s = 6 000 hung calls averted resource_saved = 6 000 calls × 5 s timeout = 30 000 thread-seconds # Without a circuit breaker, 6 000 threads sit blocked for up to 5 s each. # A typical app server with a 200-thread pool would exhaust its pool in: # 200 threads / 200 req/s = 1 second # → the entire service freezes within 1 s of B becoming slow. # With the breaker open, all 6 000 calls return instantly (fallback), # freeing threads to handle other traffic.

In concrete terms: the breaker converts 30 000 thread-seconds of blocked capacity into near-zero thread-time (a fast fallback return costs microseconds). That is the quantified value of failing fast.

Decision math: threshold vs. flap trade-off

Choosing the failure threshold and minimum request count involves two opposing risks:

Threshold	Min requests	Trips after…	Risk
20%	5	1 failure in 5 calls	False positives: normal startup noise or a single slow call trips the breaker
50%	20	10 failures in 20 calls (as traced above)	Balanced: statistically meaningful before tripping; 30 s cool-down contains damage
80%	50	40 failures in 50 calls	Trips too late: 40 failed calls × 5 s = 200 thread-seconds wasted before OPEN

Break-even: the threshold should trip the breaker before thread-pool exhaustion. With a pool of P threads, an incoming rate of R req/s, and a timeout of T seconds, the pool fills in P / R seconds. The breaker must trip within that window. With P=200 threads, R=200 req/s, T=5 s: pool fills in 1 s, so the minimum-requests window must be ≤ R × 1 s = 200 calls. The trace above (20-call window, trips after ~12 s at 200 req/s) comfortably beats this threshold.

🧠 Quick check

1. Why does a circuit breaker enter the HALF-OPEN state instead of going directly from OPEN back to CLOSED?

Going directly from OPEN to CLOSED would dump the full backlog of queued requests onto a fragile, recovering service — potentially tripping it again immediately. HALF-OPEN lets through just one probe request as a controlled test of recovery before reopening the floodgates.

2. Service A calls Service B with no timeout and no circuit breaker. B becomes slow (responds in 45 s). What happens to A's thread pool?

Without timeouts, threads hang indefinitely. A bounded thread pool fills completely, new work queues behind it, and the queue eventually overflows. A goes down — not because A's own code is broken, but because it has no mechanism to bound the damage from B's slowness.

3. You have a minimum request count of 10 and a failure threshold of 60%. After 5 requests, all 5 have failed (100%). Does the circuit trip?

The minimum request count is a statistical guard. With only 5 samples, the failure rate is not yet statistically meaningful — those 5 failures might be startup noise. The breaker only evaluates the threshold once at least 10 requests have been recorded.

4. When a circuit is OPEN, what is the best behavior for a non-critical feature like "related product recommendations"?

Graceful degradation: the page still loads and the core experience works; the recommendations section is quietly empty. Returning 500 degrades the entire page for one optional feature. Blocking is worse — it ties up resources and defeats the purpose of the circuit breaker.

✍️ Exercise: design circuit breakers for an e-commerce checkout

An e-commerce checkout page calls three downstream services: (1) a Payment service, (2) a Fraud Detection service, and (3) a Loyalty Points service. Payments are critical; Fraud Detection is important but can be skipped with a conservative fallback; Loyalty Points are optional. Design circuit breaker policies — thresholds, cool-down, and fallbacks — for each. What happens during a Fraud Detection outage? What happens during a Payment outage?

Model answer:

Payment service: Breaker with tight thresholds (30% failure rate, 10 s window, 15 s cool-down). Fallback: none — fail the checkout with a user-facing message "Payment processing is temporarily unavailable, please try again shortly." You cannot process money without the payment service; degrading silently would be worse.
Fraud Detection: Breaker with moderate thresholds (50% failure, 60 s window, 30 s cool-down). Fallback: allow the transaction but flag it for manual review and cap it at a low-risk amount (e.g., £150). The business accepts slightly elevated fraud risk for a short window rather than blocking all checkouts.
Loyalty Points: Breaker with relaxed thresholds (70% failure, 60 s window, 60 s cool-down). Fallback: skip points entirely, show "Loyalty points will be credited within 24 hours." Queue the credit write for later processing.
Fraud Detection outage: Payment proceeds with the conservative fallback. Checkout works; risk team gets notified.
Payment outage: Checkout fails with a clear message. No fraudulent approvals, no phantom charges.

Rubric: ✓ different thresholds per dependency criticality ✓ meaningful fallback per service ✓ correct escalation: Payment = hard failure, Fraud = conservative default, Points = async degrade ✓ user experience considered in fallback wording. Four out of four = excellent.

Key takeaways

A circuit breaker has three states: CLOSED (normal), OPEN (fail fast), HALF-OPEN (probe).
It trips when a dependency's error rate exceeds a threshold in a rolling window, protecting threads and giving the dependency recovery time.
Fail fast in OPEN state is the feature — it prevents cascading failure by containing the blast radius.
Always pair a circuit breaker with a meaningful fallback for graceful degradation.
Without timeouts, a circuit breaker may never trip — pending requests don't count as failures.
One circuit breaker per dependency: isolate failure domains so a broken recommendation engine can't trip the payments breaker.