Reliability & Scale · Lesson 05
Retries & Exponential Backoff
A failed request is not always a dead end — but blindly retrying is like jabbing an elevator button fifty times expecting a different result. The difference between a resilient system and a cascading disaster is knowing when to retry, how long to wait, and when to give up entirely.
By the end you'll be able to
- Classify HTTP status codes as "safe to retry" or "never retry" and explain why.
- Implement exponential backoff with full jitter and a retry cap in any language.
- Explain how retry storms amplify outages and how jitter and retry budgets prevent them.
Why retries exist: the transient-failure problem
Networks are physical. Packets traverse routers, cross undersea cables, and land on servers that share CPUs with hundreds of other tenants. Occasionally a packet gets dropped, a TCP connection times out, or a server momentarily runs out of file descriptors. These faults are transient — they go away on their own within milliseconds to a few seconds. If you waited half a second and tried again, the request would have succeeded.
The analogy: imagine calling a friend. You hear three rings and then silence — not voicemail, just silence. That's a network glitch, not your friend refusing to talk to you. You redial. You don't interpret it as a permanent rejection and delete their number.
Retrying transient failures is one of the cheapest reliability gains available to a client. But the logic must be precise, because retrying the wrong thing at the wrong time destroys the system you are trying to protect.
The retry decision tree: what is safe?
The first gate is the HTTP status code.
| Status | Meaning | Retry? | Reason |
|---|---|---|---|
408 | Request Timeout | ✅ Yes | Server never processed it; transient. |
429 | Too Many Requests | ✅ Yes, with delay | Rate-limited; honor Retry-After. |
500 | Internal Server Error | ✅ Yes (idempotent ops) | Server-side fault; often transient. |
502 | Bad Gateway | ✅ Yes | Upstream not reachable yet. |
503 | Service Unavailable | ✅ Yes, with delay | Overload; honor Retry-After. |
504 | Gateway Timeout | ✅ Yes | Upstream too slow; may self-heal. |
400 | Bad Request | 🚫 Never | Your payload is malformed; retrying is pointless. |
401 | Unauthorized | 🚫 Never | Fix your credentials first. |
403 | Forbidden | 🚫 Never | Permissions issue; server won't change its mind. |
404 | Not Found | 🚫 Never | Resource doesn't exist; retrying wastes bandwidth. |
422 | Unprocessable Entity | 🚫 Never | Semantic validation failure; fix the data. |
The pattern: 4xx errors are the client's fault (except 408 and 429). The server understood the request and rejected it. Retrying the exact same bad request is futile. 5xx errors are the server's fault — the server failed to process a valid request, so retrying with an identical request can succeed once the server recovers.
The idempotency prerequisite
The table above says "✅ Yes (idempotent ops)" for 5xx. That parenthetical is load-bearing. Before retrying, you must know whether the operation is idempotent — doing it twice produces the same result as doing it once. A GET is idempotent. A PUT that replaces a resource is idempotent. A bare POST /orders that creates a new order is not — retrying it could charge a customer twice.
The solution is to make the server deduplicate using an idempotency key sent in a request header. The server stores the key and returns the cached response if it sees the same key again. See Lesson rel-02 (Idempotency) for the full pattern. The rule: if you cannot guarantee idempotency, do not retry.
Exponential backoff: waiting smarter
Once you've confirmed a retry is safe, the next question is when. Retrying immediately puts the same load back on a server that just failed. You need to wait. But how long?
Exponential backoff doubles the wait on each attempt:
- Attempt 1 fails → wait 1 s
- Attempt 2 fails → wait 2 s
- Attempt 3 fails → wait 4 s
- Attempt 4 fails → wait 8 s
- …up to a configured cap (e.g., 30 s)
The exponential growth gives a temporarily overloaded server room to breathe. But there is a hidden danger: if thousands of clients all failed at the same instant (say, during a brief hiccup), they will all wait the same amount and then all retry at exactly the same instant. This is the thundering herd problem, and it can turn a 2-second blip into a 20-minute outage.
Jitter: breaking the thundering herd
The fix is jitter — random noise added to the backoff. Instead of waiting exactly 4 seconds on attempt 3, each client waits a random value drawn uniformly from the range [0, 4 s]. The clients spread themselves across a 4-second window instead of spiking simultaneously. The server sees a smooth drizzle of requests instead of a hammer blow.
AWS's builders' library calls this "full jitter" and recommends it over "equal jitter" (which only randomizes half the interval) for most workloads, because it maximises the spread under load.
Retry caps and budgets
Exponential backoff must have a maximum delay cap (e.g., 30 s) and a maximum attempt count. Without the cap, a client waiting 220 seconds (~12 days) is just a broken client. Without a max count, a client that never gives up ties up a thread, a connection, and potentially memory — indefinitely.
For services with many concurrent clients, a retry budget adds a second layer: a percentage ceiling on the total fraction of requests that may be retries at any given moment. If more than 10% of your outgoing traffic is retries, something is systemically wrong and further retrying is making it worse. Circuit breakers (see Lesson rel-06) handle this at a higher level.
Honoring Retry-After
When a server returns 429 Too Many Requests or 503 Service Unavailable, it often includes a Retry-After header whose value is either an integer number of seconds or an HTTP-date:
Ignoring Retry-After and retrying immediately is the fastest way to get your IP banned and escalate a rate-limit into a permanent block. Your backoff logic must check for this header and, if present, use its value as the floor for the wait, regardless of what your exponential schedule says.
Worked example: exponential backoff with full jitter
// Pseudo-code: retryWithBackoff
// Suitable for idempotent HTTP calls only.
function retryWithBackoff(request, options = {}) {
const {
maxAttempts = 4, // stop after 4 tries (1 original + 3 retries)
baseDelay = 500, // ms — first backoff window
maxDelay = 30_000, // ms — cap at 30 s regardless of exponent
} = options;
const RETRYABLE = new Set([408, 429, 500, 502, 503, 504]);
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const response = await fetch(request);
// Success or permanent failure — return immediately
if (!RETRYABLE.has(response.status)) return response;
// Last attempt — don't sleep, just surface the error
if (attempt === maxAttempts - 1) throw new Error(`Failed after ${maxAttempts} attempts: ${response.status}`);
// Honor Retry-After if present
const retryAfterSec = response.headers.get("Retry-After");
if (retryAfterSec) {
await sleep(parseFloat(retryAfterSec) * 1000);
continue;
}
// Exponential backoff with full jitter:
// wait = random(0, min(cap, base * 2^attempt))
const ceiling = Math.min(maxDelay, baseDelay * 2 ** attempt);
const jitteredDelay = Math.random() * ceiling; // uniform in [0, ceiling]
await sleep(jitteredDelay);
}
}
Walk through the attempts on a server that returns 503 for 3 seconds then recovers:
- Attempt 0 — immediate. Gets
503. Ceiling = min(30 000, 500 × 1) = 500 ms. Waits ~0–500 ms. - Attempt 1 — ~250 ms later. Gets
503. Ceiling = min(30 000, 500 × 2) = 1 000 ms. Waits ~0–1 s. - Attempt 2 — ~0.5 s later. Server has recovered. Gets
200. Returns.
Total elapsed: roughly 1–2 s. Without retries, the caller would have surfaced an error to the user and abandoned the request.
"Your service calls a flaky downstream API. How do you make it resilient?" The expected answer hits four notes: (1) classify retryable status codes; (2) require idempotency on retried ops; (3) exponential backoff with full jitter to avoid thundering-herd; (4) cap attempts and honor Retry-After. Mentioning a retry budget or pairing with a circuit breaker (see rel-06) elevates the answer to senior level.
Retrying a non-idempotent operation. A checkout flow that calls POST /payments without an idempotency key and retries on 500 can charge a customer multiple times. The server may have successfully processed the first request and then failed writing the response. Always send an idempotency key and let the server deduplicate, never rely solely on status codes for payment-style mutations.
Retry storms. A misconfigured fleet of 500 clients, each retrying up to 10 times with no jitter, can multiply traffic by 10× during the exact window when the downstream is already struggling. This turns a 10-second hiccup into a 5-minute outage. Jitter and retry budgets are not nice-to-haves — they are production safety devices.
Do: randomize your backoff, cap your delay, cap your attempt count, honor Retry-After, and only retry idempotent (or idempotency-keyed) requests. Don't: retry immediately on failure, retry 4xx errors (except 408/429), or omit a maximum — a retry loop without a ceiling runs forever.
Under the hood: the exact backoff math
The pseudocode in the worked example above uses baseDelay * 2 ** attempt. Let's trace the arithmetic precisely for a baseDelay of 500 ms and a maxDelay cap of 30 000 ms, over five attempts.
Step 1 — uncapped exponential ceiling for each attempt n (0-indexed):
ceiling(n) = base × 2ⁿ
attempt 0: ceiling = 500 × 2⁰ = 500 ms
attempt 1: ceiling = 500 × 2¹ = 1 000 ms
attempt 2: ceiling = 500 × 2² = 2 000 ms
attempt 3: ceiling = 500 × 2³ = 4 000 ms
attempt 4: ceiling = 500 × 2⁴ = 8 000 ms
attempt 5: ceiling = 500 × 2⁵ = 16 000 ms
attempt 6: ceiling = 500 × 2⁶ = 32 000 ms → capped at 30 000 ms
Step 2 — apply the cap: capped_ceiling = min(maxDelay, base × 2ⁿ)
Step 3 — full jitter: draw a uniform random number in [0, capped_ceiling):
wait(n) = random_uniform(0, min(maxDelay, base × 2ⁿ))
A concrete five-attempt sequence with illustrative random draws:
| Attempt | Uncapped ceiling (ms) | Capped ceiling (ms) | Jitter draw (ms) | Actual wait (ms) |
|---|---|---|---|---|
| 0 (original) | — | — | — | 0 (immediate) |
| 1 (retry 1) | 500 | 500 | 0.74 × 500 | 370 |
| 2 (retry 2) | 1 000 | 1 000 | 0.22 × 1000 | 220 |
| 3 (retry 3) | 2 000 | 2 000 | 0.88 × 2000 | 1 760 |
| 4 (retry 4) | 4 000 | 4 000 | 0.41 × 4000 | 1 640 |
| 5 (retry 5) | 8 000 | 8 000 | 0.06 × 8000 | 480 |
Total elapsed above: ~4.5 s for 5 retries. Notice that some retries happen faster than a naive schedule would dictate (retry 5 = 480 ms) — this is intentional: individual clients may recover quickly while the population spreads out. Without jitter, every client would see exactly {0, 500, 1000, 2000, 4000, 8000} ms — perfectly synchronized spikes.
Why jitter prevents synchronized retry storms. Consider 1 000 clients all failing at time T=0:
- No jitter: at T+500 ms, all 1 000 send retry 1 simultaneously. The server, which is recovering, receives a 1 000-request spike — potentially re-triggering the failure. At T+1000 ms, same spike, now the server has to absorb a second wave while still recovering from the first.
- Full jitter: at T+0 to T+500 ms, retries trickle in at roughly 2 requests/ms (1 000 spread over 500 ms). The server sees a smooth drizzle of ~2 RPS instead of 1 000-RPS spikes. A server recovering from overload handles 2 RPS fine; it cannot handle 1 000-RPS synchronized hammering.
The math: with full jitter, the expected wait for a single client on attempt n is min(maxDelay, base × 2ⁿ) / 2 — exactly half the ceiling on average. This is longer than the no-jitter fixed wait only when the cap isn't yet hit, but the population-level load reduction makes it strictly better for systems under stress.
Some implementations use decorrelated jitter: sleep = random(base, prev_sleep × 3). This produces a sequence uncorrelated across attempts (no client waits exactly the same times twice), which is good for certain distributed scenarios. AWS's Builders' Library explicitly prefers full jitter (random(0, min(cap, base × 2ⁿ))) for most API clients because it is simpler to reason about, produces a known average, and achieves equivalent spread. Use full jitter unless you have a specific reason for decorrelated.
How to debug & inspect it
A retry storm looks deceptively like a sudden traffic surge. The key diagnostic is the request multiplier: retries make your outgoing request volume larger than your incoming request volume. If you're receiving 100 RPS from users but sending 300 RPS to the downstream, you have a 3× multiplier — a strong indicator of aggressive retrying.
| Symptom | Likely cause | Fix |
|---|---|---|
| Outgoing RPS ≫ incoming RPS (multiplier >1.5×) | Retry storm — too many retries per failure, no jitter, or no backoff cap | Add full jitter; reduce maxAttempts; add a retry budget (max 10% of traffic may be retries) |
| Downstream sees a traffic spike exactly N seconds after an outage starts | No jitter — all clients retry at the same exponential interval | Add random(0, ceiling) jitter; verify the jitter is applied before the sleep, not after |
| Customer charged twice / order created twice | Non-idempotent POST retried without an idempotency key | Generate a stable idempotency key per user action (UUID stored client-side); reuse it on every retry |
| Client immediately banned after hitting 429 | Retry-After header ignored; client retried immediately | Read Retry-After; treat it as the floor for the next wait, overriding the computed backoff if shorter |
| Retry logic runs forever, blocking a thread | No maxAttempts cap or no maxDelay cap | Always set both; surface the error to the caller after exhausting attempts |
| "Infinite retry" — service restarts but retries resume from attempt 0 | Attempt counter is in-memory, lost on restart | For long-running retries, persist the attempt count (e.g., in a job queue); use a dead-letter queue after N attempts |
Retry-config review checklist:
- Is the retryable status-code set explicit and correct? Confirm 4xx codes (except 408/429) are excluded.
- Is every retried endpoint idempotent — either by HTTP semantics (GET/PUT/DELETE) or by an idempotency-key header?
- Is full jitter applied? Run the backoff formula 10 times and verify the outputs are not identical.
- Is there a
maxAttemptscap (≤5 for most APIs) and amaxDelaycap (≤60 s)? - Does the code read
Retry-Afterfrom 429/503 responses and honor it as a floor? - Is there a fleet-level retry budget (reject retries if >X% of outgoing traffic is already retries)?
By the numbers
Scenario: a payment microservice calls an external processor at a baseline of 500 req/s. Each call is configured with up to 3 retries (max 4 attempts total). During a 10-second partial outage the processor returns 503 on every attempt.
Backoff schedule: delay_n = min(cap, base · 2n)
With base = 100 ms and cap = 2 000 ms (2 s), the per-attempt ceiling and expected wait under full jitter (random(0, delay_n), so expected = delay_n / 2) are:
| Attempt n | Uncapped (ms) | delay_n = min(cap, base·2n) (ms) | Full jitter: expected wait (ms) | Result (all 503) |
|---|---|---|---|---|
| 0 — original | — | — | 0 (immediate) | 503 → retry |
| 1 — retry 1 | 200 | 200 | 100 | 503 → retry |
| 2 — retry 2 | 400 | 400 | 200 | 503 → retry |
| 3 — retry 3 | 800 | 800 | 400 | 503 → give up |
Expected elapsed per call: 0 + 100 + 200 + 400 = 700 ms average before the final failure surfaces. (Without jitter, the fixed sequence is 0 + 200 + 400 + 800 = 1 400 ms but all clients synchronize — the jittered version is faster on average and spreads load.)
If the cap is hit earlier (e.g. at n=4, base=100 ms, cap=2 s): attempts n≥4 all have delay_n = 2 000 ms and expected wait = 1 000 ms, so the schedule plateaus. Formula: delay_n = min(2000, 100 × 2n) hits the cap at n = log2(2000/100) = log2(20) ≈ 4.3, i.e. from attempt 5 onwards. See: AWS Builders’ Library — Timeouts, retries, and backoff with jitter.
Retry amplification: worst-case traffic multiplier
When the processor is overloaded, every request fails and every client retries up to 3 times. The outgoing request rate seen by the processor becomes:
The table below shows how the multiplier scales with retry count and baseline load. The processor was already struggling at 500 req/s; it now receives 2 000 req/s, ensuring the outage extends far longer than the original trigger:
| Baseline QPS | Max retries | Multiplier (1 + retries) | Amplified QPS | Effect |
|---|---|---|---|---|
| 500 | 1 | 2× | 1 000 | Manageable surge |
| 500 | 3 | 4× | 2 000 | Processor overwhelmed |
| 500 | 9 | 10× | 5 000 | Catastrophic; deepens outage |
| 1 000 | 3 | 4× | 4 000 | Cascading failure territory |
This is why aggressive retry counts turn a brief blip into a prolonged outage: each failing request spawns N clones, all hitting the same struggling system simultaneously.
Decision math: retry budget — keeping the multiplier ≤ X
A retry budget caps the fraction of outgoing traffic that may be retries at any instant. If you want to limit the amplification multiplier to at most 1.10× (i.e. retries add no more than 10% overhead):
Equivalently, for a target multiplier M, the maximum retry fraction is (M - 1) / M. For M = 1.5 that is 33%; for M = 2 it is 50%; for M = 4 it is 75% — already a red flag. Pair this with a circuit breaker (Lesson rel-06) so the breaker opens before the budget is exhausted.
Play with the retry-storm simulator — drag the load toward 100M req/s and watch this behaviour in real time.
🧠 Quick check
1. A client receives HTTP 404 Not Found. What should it do?
404 is a permanent, client-side error: the server understood the request and the resource simply isn't there. Retrying won't create the resource. Fix the URL or handle the missing resource in application logic.
2. Why is "full jitter" (random value in [0, ceiling]) preferred over retrying at the exact backoff interval?
Full jitter doesn't help one individual client — it may even make that client wait longer. Its value is collective: when many clients all fail at the same instant, jitter stops them from all retrying at the same instant, distributing load across the window instead of creating a synchronized spike.
3. A server returns HTTP 429 with header Retry-After: 45. The backoff formula computes a 3-second delay. How long should the client wait?
The server is the authority on its own rate limits. Retry-After communicates exactly when the server will accept another request. Ignoring it and retrying sooner achieves nothing and may escalate the block. Always treat Retry-After as a floor, not a suggestion.
4. Which condition makes retrying a POST /transfers call safe?
A network timeout doesn't mean the server didn't receive the request — it may have processed it and failed sending the response. The only safe retry of a state-changing operation is when the server deduplicates using an idempotency key, so a duplicate request is a no-op.
✍️ Exercise: design a retry policy for a payment service
You're building a microservice that calls an external payment processor. The processor can return 500, 503, 429, 400, and 402 Payment Required. Design the retry policy: which statuses retry, what are the backoff parameters, and what safeguards prevent a fleet of 200 service instances from amplifying an outage?
Model answer:
- Retryable:
500,503,429(server-side and transient). Not retryable:400(bad request — fix the payload),402(insufficient funds — retrying won't conjure money). - Idempotency key: generate a UUID per payment attempt, send as
Idempotency-Key: <uuid>. Reuse the same key on retries so the processor deduplicates. - Backoff: base 500 ms, full jitter, cap 20 s. Max 4 attempts (1 original + 3 retries).
- Retry-After: if
429or503includes the header, sleep formax(Retry-After, jitteredDelay). - Fleet protection: configure a retry budget (e.g., no more than 10% of outgoing payment calls may be retries at any moment). If the budget is exhausted, fail fast and let the circuit breaker handle the rest.
Rubric: ✓ correctly excludes 400/402 ✓ idempotency key with reuse on retry ✓ full jitter mentioned ✓ Retry-After honored ✓ fleet-level safeguard (budget or circuit breaker) addressed. Four out of five = solid; five = excellent.
Key takeaways
- Retry 5xx and transient errors; never retry 4xx (except 408/429) — those are the client's fault.
- Idempotency is a prerequisite for retrying any state-changing operation; use idempotency keys.
- Exponential backoff gives a recovering server breathing room; full jitter prevents the thundering herd.
- Always set a max-attempts cap and a max-delay cap; unbounded retries are production bugs.
- Always honor
Retry-After; ignoring it is the fastest path to a permanent ban. - Retry budgets protect the fleet: if retries dominate traffic, you're amplifying the outage, not recovering from it.