API Design

Production at Scale · Simulator 05

Retry amplification storm

When a service starts failing, clients retry — and those retries generate more load on an already struggling service, causing more failures, which cause more retries. Drag fail rate and max retries together and watch the traffic multiplier explode. This is the mechanism behind the "thundering herd on failure" that takes down services during incidents.

InteractiveDrag the slidersModels rel-05

Bars: baseline vs amplified QPS. Dashed line = system capacity. Curve = multiplier across all fail rates for the chosen retry count, with the current point marked (●).

What's happening — the math

If every request has an independent failure probability f, and a client retries up to R more times on failure, the expected number of attempts per original request is the truncated geometric series:

# Probability of still failing after k attempts = f^k
# Expected attempts = sum_{k=0}^{R} f^k  (geometric series)

attempts     = (1 − f^(R+1)) / (1 − f)      # when f ≠ 1
             = R + 1                          # when f = 1 (all attempts fail)

amplifiedQPS = baseQPS × attempts
extraLoad    = amplifiedQPS − baseQPS

# The vicious cycle: amplified load → higher f → more retries → …
# You need: amplifiedQPS ≤ capacity at ALL points in the cycle

At low fail rates the series converges quickly — 5% failure with 3 retries gives ≈1.05× amplification. But as f → 1, amplification approaches R + 1. The danger zone is the middle: a 50% outage with 3 retries gives 1.94×, potentially pushing an overloaded system past capacity and making the failure rate worse.

✅ Try this

1. Set base QPS to 500 k/s, capacity 1 M/s, fail rate 50%, retries 3 → amplified load ≈ 970 k/s, still under capacity. 2. Raise retries to 5 → amplified ≈ 1.22 M/s → over capacity → real fail rate rises → even more retries → cascade. 3. Set retries to 0 → no amplification, fail rate has no multiplier effect. 4. Keep retries at 2 and sweep fail rate from 10% to 90% — watch the curve vs the capacity line.

⚠️ Modeled, not measured

This is a first-principles model using the geometric series for independent retry attempts. Real retry storms are more complex: retries may be correlated, exponential backoff with jitter changes timing, circuit breakers may open before all retries fire, and retry budgets cap total attempts. The model shows the structural amplification — treat numbers as illustrative and use them to build intuition for retry budget sizing.

Sources & further reading