Reliability & Scale · Lesson 06
Circuit Breaker Pattern
When a downstream service is drowning, the last thing it needs is your service pouring more requests on top. The circuit breaker is an automatic trip-switch: it detects the failure, cuts the flow, and only restores it once the downstream has had a chance to breathe.
By the end you'll be able to
- Describe the three circuit-breaker states and the conditions that transition between them.
- Explain why hammering a failing dependency is dangerous and how a circuit breaker prevents it.
- Sketch a circuit breaker in code, including thresholds, cool-down, and graceful degradation.
The problem: hammering a drowning neighbor
Picture a neighborhood sharing a single water main. One house springs a burst pipe; the water pressure in the whole street drops. Every other house starts running all their taps harder, trying to compensate — which drops the pressure even more. Now the problem has spread from one burst pipe to the entire street.
Microservices fail the same way. Service A calls Service B. B is overloaded and starts responding slowly. A's request threads pile up waiting for B to answer. A's own response times balloon. Services C and D that call A now start timing out. One slow dependency has metastasized into a full cascading failure — not because B was catastrophically broken, but because everyone kept hammering it instead of backing off.
This is the problem the circuit breaker was designed to solve. The name is borrowed directly from electrical engineering: a physical circuit breaker trips when current exceeds a safe threshold, protecting the wiring. In software, a circuit breaker trips when a downstream's error rate exceeds a threshold, protecting threads, connections, and users.
The three states
A circuit breaker wraps every call to an external dependency and maintains a small state machine with three states:
Closed
The normal operating state. All calls pass through to the dependency. The breaker counts successes and failures inside a rolling time window (e.g., the last 60 seconds). As long as the failure rate stays below a configured threshold (e.g., 50%), nothing changes.
Open
The breaker has tripped. Calls are rejected immediately — the dependency is not contacted at all. This "fail fast" behavior is crucial: instead of tying up threads waiting for a service that can't respond, the caller gets an instant error and can apply a fallback. The dependency simultaneously gets zero additional load, giving it room to recover.
Half-Open
After a configured cool-down period (e.g., 30 seconds), the breaker enters a cautious probe state. It allows exactly one (or a small fixed number of) request(s) through to the real dependency. If the probe succeeds, the circuit closes and normal traffic resumes. If the probe fails, the circuit trips back to OPEN and the cool-down restarts. This gradual re-entry prevents immediately flooding a fragile but recovering service.
Configurable thresholds
A production circuit breaker needs at least four tuneable parameters:
| Parameter | Purpose | Example default |
|---|---|---|
| Window duration | How long to count failures over | 60 s rolling |
| Failure threshold | Error rate % that trips the breaker | 50% |
| Minimum request count | Ignore threshold until N requests seen (avoids tripping on startup noise) | 10 requests |
| Cool-down (sleep window) | How long to stay OPEN before probing | 30 s |
The minimum request count matters: if you've only had 2 requests and both failed, that's 100% — but it's not statistically meaningful. Without a floor count, a breaker trips the moment a service boots up and the first two warmup calls are slow.
Fallbacks and graceful degradation
A circuit breaker alone just replaces a slow failure with a fast one. The real value is pairing it with a fallback: an alternative behavior when the dependency is unavailable. Fallback options, roughly from best to worst user experience:
- Cached data — return the last known-good response if it's still within an acceptable staleness window.
- Default / empty state — return an empty list, a zero count, or a neutral default rather than an error.
- Partial response — omit the unavailable section and flag it:
{ "recommendations": null, "recommendations_available": false }. - Queue for later — accept the write and process it once the dependency recovers (for async operations).
- Clear error message — if no fallback is viable, return a user-friendly message rather than a 500.
The goal is graceful degradation: a service that behaves "less well" during a dependency outage is far preferable to one that becomes completely unavailable.
Pairing with timeouts and bulkheads
A circuit breaker without a timeout is toothless. If calls to the dependency have no deadline, a thread can wait indefinitely — the circuit never accumulates enough failures to trip because most requests are "pending" rather than "failed". Set an explicit connection timeout and read timeout on every outbound call. Typically these are separate: connection timeout 500 ms, read timeout 5 s.
A bulkhead complements this: it limits the number of concurrent calls to a single dependency using a dedicated thread pool or semaphore. If the pool is exhausted, new requests fail fast immediately rather than queuing — containing the blast radius to calls destined for that one dependency, not the entire caller service.
Worked example: minimal circuit breaker
// Pseudo-code: SimpleCircuitBreaker
// Wraps any callable that can fail (e.g., an HTTP client call).
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold ?? 0.5; // 50%
this.minRequests = options.minRequests ?? 10;
this.windowMs = options.windowMs ?? 60_000;
this.cooldownMs = options.cooldownMs ?? 30_000;
this.state = "CLOSED";
this.openedAt = null;
this.calls = []; // ring-buffer of { ts, success }
}
// Call this instead of calling the dependency directly.
async call(fn, fallback) {
if (this.state === "OPEN") {
const elapsed = Date.now() - this.openedAt;
if (elapsed < this.cooldownMs) {
// Fail fast — don't touch the dependency
return fallback ? fallback() : throw new Error("Circuit OPEN");
}
// Cool-down expired → probe
this.state = "HALF_OPEN";
}
try {
const result = await fn();
this._record(true);
if (this.state === "HALF_OPEN") {
this.state = "CLOSED"; // probe passed — reopen traffic
this.calls = [];
}
return result;
} catch (err) {
this._record(false);
if (this.state === "HALF_OPEN") {
this._trip(); // probe failed — stay OPEN
}
throw err;
}
}
_record(success) {
const now = Date.now();
// Evict entries outside the rolling window
this.calls = this.calls.filter(c => now - c.ts < this.windowMs);
this.calls.push({ ts: now, success });
if (this.calls.length >= this.minRequests) {
const failures = this.calls.filter(c => !c.success).length;
if (failures / this.calls.length > this.failureThreshold) {
this._trip();
}
}
}
_trip() {
this.state = "OPEN";
this.openedAt = Date.now();
}
}
// Usage:
const breaker = new CircuitBreaker({ failureThreshold: 0.5, cooldownMs: 30_000 });
async function getUserProfile(userId) {
return breaker.call(
() => profileService.get(userId), // real call
() => ({ name: "Unknown", avatar: null }) // fallback
);
}
"How would you protect your service from a flaky downstream dependency?" The answer that stands out: (1) set a timeout on every outbound call; (2) wrap calls in a circuit breaker with explicit thresholds; (3) define a fallback for when the circuit is open; (4) monitor the breaker state as a metric so an engineer knows when a dependency has tripped. Tying this to retries (rel-05) and bulkheads shows you understand the full resilience stack — that's senior-level thinking.
No timeout → thread/connection exhaustion. Without a read timeout, threads accumulate waiting for a dependency that never replies. Your thread pool fills, new requests queue, the queue fills, your service stops processing anything — and your circuit breaker never trips because calls are "pending" rather than "failed". Always set both a connection timeout and a read timeout on every outbound call, making timeout failures count toward the breaker's failure rate.
Cascading failure without a circuit breaker. Service A calls B; B is slow. A's threads wait. A becomes slow. C calls A; C's threads wait. C becomes slow. One overloaded leaf node has taken down the entire call chain. A circuit breaker at each service boundary — with a fast fallback — contains the blast to the single dependency, not the entire graph.
Do: pair every circuit breaker with an explicit timeout, a meaningful fallback, and an observable metric (state + error rate). Tune thresholds in staging before going to production — a threshold that's too sensitive will flip the circuit on normal traffic spikes. Don't: share one circuit breaker instance across multiple logical dependencies — each dependency gets its own breaker, so a failure in the recommendation engine doesn't trip the breaker protecting the payment service.
Under the hood: the state machine implemented
The three states are easy to draw in a diagram. What matters for real debugging is knowing exactly which variables drive the transitions and how to trace a breaker through a failure sequence. A breaker needs five pieces of state: a failure counter (or rolling window), a request counter, a threshold ratio, an open-timestamp, and the current state enum.
Worked numeric example. Parameters: failureThreshold = 0.5, minRequests = 10, windowMs = 60_000, cooldownMs = 30_000. Service B starts failing at T=0.
| Time (s) | Event | Window calls | Failures | Failure rate | State |
|---|---|---|---|---|---|
| T=0 to T=5 | 8 successful calls, 0 failures | 8 | 0 | 0% (min not met) | CLOSED |
| T=6 | Call 9 fails | 9 | 1 | 11% (min not met) | CLOSED |
| T=7 | Call 10 fails | 10 | 2 | 20% — min met, below 50% | CLOSED |
| T=8 to T=12 | Calls 11–15 all fail | 15 | 7 | 47% — still below 50% | CLOSED |
| T=13 | Call 16 fails → 8/16 = 50% — threshold hit | 16 | 8 | 50% ≥ threshold → TRIP | OPEN (openedAt=T+13) |
| T=14 to T=42 | All incoming calls fail fast, upstream gets 0 requests | — | — | — | OPEN |
| T=43 | elapsed = 30 s ≥ cooldownMs → probe allowed | — | — | — | HALF-OPEN |
| T=43 (probe) | Probe succeeds → reset window | 1 | 0 | 0% | CLOSED |
If instead the probe at T=43 fails, openedAt is reset to T=43, and the 30 s cool-down restarts — the next probe is allowed at T=73.
Pseudo-code implementing the transitions with comments on which line drives which row above:
// State variables (per breaker instance, per dependency)
state = "CLOSED"
openedAt = null
window = [] // rolling array of {ts, success}
function call(fn, fallback) {
if (state === "OPEN") {
if (Date.now() - openedAt < COOLDOWN_MS) {
return fallback() // fail fast — rows T=14..T=42 above
}
state = "HALF_OPEN" // row T=43
}
try {
result = fn() // actual dependency call
record(success=true)
if (state === "HALF_OPEN") {
state = "CLOSED" // probe passed — row "Probe succeeds"
window = []
}
return result
} catch (err) {
record(success=false)
if (state === "HALF_OPEN") trip() // probe failed — reset cool-down
throw err
}
}
function record(success) {
now = Date.now()
window = window.filter(e => now - e.ts < WINDOW_MS) // evict stale
window.push({ts: now, success})
if (window.length >= MIN_REQUESTS) { // row T=7 — min-requests gate
failures = window.filter(e => !e.success).length
if (failures / window.length >= THRESHOLD) trip() // row T=13
}
}
function trip() {
state = "OPEN"
openedAt = Date.now()
}
How to debug & inspect it
A circuit breaker that has tripped is silent by default — requests fail fast with no call to the upstream, so the upstream logs show nothing. If you're seeing sudden 5xx/client errors without any corresponding upstream activity, the breaker is the first place to look.
The key Prometheus metrics to expose from a circuit breaker and alert on:
| Metric | What it tells you | Alert condition |
|---|---|---|
circuit_breaker_state{dep="X"} | Current state (0/1/2 = CLOSED/HALF/OPEN) | Alert if OPEN for > 5 min (dependency stuck down) |
circuit_breaker_failure_rate{dep="X"} | Rolling failure rate in the window | Alert at >30% even before tripping — catch degradation early |
circuit_breaker_calls_total{outcome="short_circuited"} | Requests that hit the OPEN breaker and returned fallback | Spike = breaker is open; count the user impact |
circuit_breaker_slow_call_rate{dep="X"} | Fraction of calls exceeding the slow-call threshold | Alert at >20% — slowness often precedes failures |
Tuning thresholds — the two failure modes:
- Threshold too sensitive (e.g. 20% with minRequests=5): the breaker trips on brief traffic spikes or normal startup noise, causing false-positive outages. Symptoms: breaker flips open and closed rapidly ("flapping"), logs show OPEN→HALF_OPEN→OPEN cycles within seconds.
- Threshold too permissive (e.g. 90%): the breaker doesn't trip until the dependency is almost completely down, so your service keeps hammering a failing upstream and the cascading failure isn't contained early enough.
| Symptom | Likely cause | Fix |
|---|---|---|
| Breaker won't close — stays OPEN indefinitely | Dependency is genuinely still down; OR the probe itself is timing out because the timeout is shorter than the recovery time | Verify the downstream is actually healthy (direct health check); increase the probe timeout; lengthen the cool-down to give the service more recovery time |
| Breaker flaps — rapidly oscillates OPEN ↔ HALF-OPEN ↔ OPEN | Cool-down too short; dependency is partially recovered (intermittent failures); or single-probe half-open is too aggressive for a slowly recovering service | Increase cooldownMs; configure the half-open probe to require multiple consecutive successes before closing (e.g. 3 successes) |
| Breaker trips on startup / deployment | minRequests too low; first few warmup calls are slow and counted as failures | Increase minRequests to at least 10–20; exclude health-check calls from the counter |
| Breaker never trips during an obvious outage | Timeout not set — calls hang as "pending" and aren't recorded as failures; OR errors are caught silently before reaching the breaker | Set an explicit read timeout; ensure all error paths (timeouts, connection resets) are counted as failures in record(success=false) |
| Fallback returns stale/wrong data without any indication | No observability on fallback activations | Log and increment a circuit_breaker_fallback_total counter every time a fallback fires; set an alert on it |
Breaker tuning checklist:
- Expose current state and failure rate as Prometheus metrics; set an alert for OPEN > 5 min.
- Set a conservative
minRequests(≥10) to avoid tripping on startup noise. - Verify every outbound call has both a connect timeout and a read timeout, and that timeouts are counted as failures by the breaker.
- Configure a meaningful fallback — not just a bare error, but a degraded response or cached value.
- Test the breaker in staging: inject failures (return 500 from a mock upstream) and verify the breaker trips at the expected failure count, stays open for the cool-down period, and closes on probe success.
- Each dependency gets its own breaker instance — never share a breaker across two different downstream services.
By the numbers
Scenario: Service A calls a recommendation engine (Service B) at 200 req/s. Service B has a 5 s read timeout. A rolling window of 20 calls and a failure threshold of 50% are configured, with a 30 s cool-down.
Rolling window trace: when does the breaker trip?
Service B starts failing at T=0. The breaker window holds the last 20 calls. Each row below shows the window state after the indicated call:
| Time (s) | Event | Window calls | Failures in window | Failure rate | State |
|---|---|---|---|---|---|
| T=0 – T=8 | Calls 1–16: 10 succeed, 6 fail | 16 | 6 | 37.5% — below 50% | CLOSED |
| T=9 | Call 17 fails → 7/17 = 41.2% | 17 | 7 | 41.2% | CLOSED |
| T=10 | Call 18 fails → 8/18 = 44.4% | 18 | 8 | 44.4% | CLOSED |
| T=11 | Call 19 fails → 9/19 = 47.4% | 19 | 9 | 47.4% | CLOSED |
| T=12 | Call 20 fails → 10/20 = 50% ≥ threshold | 20 | 10 | 50% → TRIP | OPEN (openedAt=T+12) |
| T=12 – T=42 | All calls fail fast (fallback returned, no request to B) | — | — | — | OPEN |
| T=42 | elapsed = 30 s ≥ cooldown → probe allowed | — | — | — | HALF-OPEN |
| T=42 (probe) | Probe succeeds → window reset | 1 | 0 | 0% | CLOSED |
Source for rolling-window mechanics: Martin Fowler — CircuitBreaker; production parameters from Azure Architecture Center — Circuit Breaker pattern.
Calls and resources saved by failing fast
Once OPEN at T=12, the breaker blocks all requests to Service B for the 30 s cool-down. At 200 req/s, how much waste is avoided?
In concrete terms: the breaker converts 30 000 thread-seconds of blocked capacity into near-zero thread-time (a fast fallback return costs microseconds). That is the quantified value of failing fast.
Decision math: threshold vs. flap trade-off
Choosing the failure threshold and minimum request count involves two opposing risks:
| Threshold | Min requests | Trips after… | Risk |
|---|---|---|---|
| 20% | 5 | 1 failure in 5 calls | False positives: normal startup noise or a single slow call trips the breaker |
| 50% | 20 | 10 failures in 20 calls (as traced above) | Balanced: statistically meaningful before tripping; 30 s cool-down contains damage |
| 80% | 50 | 40 failures in 50 calls | Trips too late: 40 failed calls × 5 s = 200 thread-seconds wasted before OPEN |
Break-even: the threshold should trip the breaker before thread-pool exhaustion. With a pool of P threads, an incoming rate of R req/s, and a timeout of T seconds, the pool fills in P / R seconds. The breaker must trip within that window. With P=200 threads, R=200 req/s, T=5 s: pool fills in 1 s, so the minimum-requests window must be ≤ R × 1 s = 200 calls. The trace above (20-call window, trips after ~12 s at 200 req/s) comfortably beats this threshold.
🧠 Quick check
1. Why does a circuit breaker enter the HALF-OPEN state instead of going directly from OPEN back to CLOSED?
Going directly from OPEN to CLOSED would dump the full backlog of queued requests onto a fragile, recovering service — potentially tripping it again immediately. HALF-OPEN lets through just one probe request as a controlled test of recovery before reopening the floodgates.
2. Service A calls Service B with no timeout and no circuit breaker. B becomes slow (responds in 45 s). What happens to A's thread pool?
Without timeouts, threads hang indefinitely. A bounded thread pool fills completely, new work queues behind it, and the queue eventually overflows. A goes down — not because A's own code is broken, but because it has no mechanism to bound the damage from B's slowness.
3. You have a minimum request count of 10 and a failure threshold of 60%. After 5 requests, all 5 have failed (100%). Does the circuit trip?
The minimum request count is a statistical guard. With only 5 samples, the failure rate is not yet statistically meaningful — those 5 failures might be startup noise. The breaker only evaluates the threshold once at least 10 requests have been recorded.
4. When a circuit is OPEN, what is the best behavior for a non-critical feature like "related product recommendations"?
Graceful degradation: the page still loads and the core experience works; the recommendations section is quietly empty. Returning 500 degrades the entire page for one optional feature. Blocking is worse — it ties up resources and defeats the purpose of the circuit breaker.
✍️ Exercise: design circuit breakers for an e-commerce checkout
An e-commerce checkout page calls three downstream services: (1) a Payment service, (2) a Fraud Detection service, and (3) a Loyalty Points service. Payments are critical; Fraud Detection is important but can be skipped with a conservative fallback; Loyalty Points are optional. Design circuit breaker policies — thresholds, cool-down, and fallbacks — for each. What happens during a Fraud Detection outage? What happens during a Payment outage?
Model answer:
- Payment service: Breaker with tight thresholds (30% failure rate, 10 s window, 15 s cool-down). Fallback: none — fail the checkout with a user-facing message "Payment processing is temporarily unavailable, please try again shortly." You cannot process money without the payment service; degrading silently would be worse.
- Fraud Detection: Breaker with moderate thresholds (50% failure, 60 s window, 30 s cool-down). Fallback: allow the transaction but flag it for manual review and cap it at a low-risk amount (e.g., £150). The business accepts slightly elevated fraud risk for a short window rather than blocking all checkouts.
- Loyalty Points: Breaker with relaxed thresholds (70% failure, 60 s window, 60 s cool-down). Fallback: skip points entirely, show "Loyalty points will be credited within 24 hours." Queue the credit write for later processing.
- Fraud Detection outage: Payment proceeds with the conservative fallback. Checkout works; risk team gets notified.
- Payment outage: Checkout fails with a clear message. No fraudulent approvals, no phantom charges.
Rubric: ✓ different thresholds per dependency criticality ✓ meaningful fallback per service ✓ correct escalation: Payment = hard failure, Fraud = conservative default, Points = async degrade ✓ user experience considered in fallback wording. Four out of four = excellent.
Key takeaways
- A circuit breaker has three states: CLOSED (normal), OPEN (fail fast), HALF-OPEN (probe).
- It trips when a dependency's error rate exceeds a threshold in a rolling window, protecting threads and giving the dependency recovery time.
- Fail fast in OPEN state is the feature — it prevents cascading failure by containing the blast radius.
- Always pair a circuit breaker with a meaningful fallback for graceful degradation.
- Without timeouts, a circuit breaker may never trip — pending requests don't count as failures.
- One circuit breaker per dependency: isolate failure domains so a broken recommendation engine can't trip the payments breaker.