Reliability & Scale · Lesson 05

Retries & Exponential Backoff

A failed request is not always a dead end — but blindly retrying is like jabbing an elevator button fifty times expecting a different result. The difference between a resilient system and a cascading disaster is knowing when to retry, how long to wait, and when to give up entirely.

⏱ 12 min Difficulty: core Prereq: Idempotency (rel-02), API Gateway (rel-04)

By the end you'll be able to

Classify HTTP status codes as "safe to retry" or "never retry" and explain why.
Implement exponential backoff with full jitter and a retry cap in any language.
Explain how retry storms amplify outages and how jitter and retry budgets prevent them.

Why retries exist: the transient-failure problem

Networks are physical. Packets traverse routers, cross undersea cables, and land on servers that share CPUs with hundreds of other tenants. Occasionally a packet gets dropped, a TCP connection times out, or a server momentarily runs out of file descriptors. These faults are transient — they go away on their own within milliseconds to a few seconds. If you waited half a second and tried again, the request would have succeeded.

The analogy: imagine calling a friend. You hear three rings and then silence — not voicemail, just silence. That's a network glitch, not your friend refusing to talk to you. You redial. You don't interpret it as a permanent rejection and delete their number.

Retrying transient failures is one of the cheapest reliability gains available to a client. But the logic must be precise, because retrying the wrong thing at the wrong time destroys the system you are trying to protect.

The retry decision tree: what is safe?

The first gate is the HTTP status code.

Status	Meaning	Retry?	Reason
`408`	Request Timeout	✅ Yes	Server never processed it; transient.
`429`	Too Many Requests	✅ Yes, with delay	Rate-limited; honor `Retry-After`.
`500`	Internal Server Error	✅ Yes (idempotent ops)	Server-side fault; often transient.
`502`	Bad Gateway	✅ Yes	Upstream not reachable yet.
`503`	Service Unavailable	✅ Yes, with delay	Overload; honor `Retry-After`.
`504`	Gateway Timeout	✅ Yes	Upstream too slow; may self-heal.
`400`	Bad Request	🚫 Never	Your payload is malformed; retrying is pointless.
`401`	Unauthorized	🚫 Never	Fix your credentials first.
`403`	Forbidden	🚫 Never	Permissions issue; server won't change its mind.
`404`	Not Found	🚫 Never	Resource doesn't exist; retrying wastes bandwidth.
`422`	Unprocessable Entity	🚫 Never	Semantic validation failure; fix the data.

The pattern: 4xx errors are the client's fault (except 408 and 429). The server understood the request and rejected it. Retrying the exact same bad request is futile. 5xx errors are the server's fault — the server failed to process a valid request, so retrying with an identical request can succeed once the server recovers.

The idempotency prerequisite

The table above says "✅ Yes (idempotent ops)" for 5xx. That parenthetical is load-bearing. Before retrying, you must know whether the operation is idempotent — doing it twice produces the same result as doing it once. A GET is idempotent. A PUT that replaces a resource is idempotent. A bare POST /orders that creates a new order is not — retrying it could charge a customer twice.

The solution is to make the server deduplicate using an idempotency key sent in a request header. The server stores the key and returns the cached response if it sees the same key again. See Lesson rel-02 (Idempotency) for the full pattern. The rule: if you cannot guarantee idempotency, do not retry.

Exponential backoff: waiting smarter

Once you've confirmed a retry is safe, the next question is when. Retrying immediately puts the same load back on a server that just failed. You need to wait. But how long?

Exponential backoff doubles the wait on each attempt:

Attempt 1 fails → wait 1 s
Attempt 2 fails → wait 2 s
Attempt 3 fails → wait 4 s
Attempt 4 fails → wait 8 s
…up to a configured cap (e.g., 30 s)

The exponential growth gives a temporarily overloaded server room to breathe. But there is a hidden danger: if thousands of clients all failed at the same instant (say, during a brief hiccup), they will all wait the same amount and then all retry at exactly the same instant. This is the thundering herd problem, and it can turn a 2-second blip into a 20-minute outage.

Jitter: breaking the thundering herd

The fix is jitter — random noise added to the backoff. Instead of waiting exactly 4 seconds on attempt 3, each client waits a random value drawn uniformly from the range [0, 4 s]. The clients spread themselves across a 4-second window instead of spiking simultaneously. The server sees a smooth drizzle of requests instead of a hammer blow.

AWS's builders' library calls this "full jitter" and recommends it over "equal jitter" (which only randomizes half the interval) for most workloads, because it maximises the spread under load.

Top: without jitter, retries arrive as synchronized spikes that overwhelm a recovering server. Middle: jitter spreads load smoothly. Bottom: retry storms, where each wave of retries generates more failures and more retries, can extend an outage far beyond its original cause.

Retry caps and budgets

Exponential backoff must have a maximum delay cap (e.g., 30 s) and a maximum attempt count. Without the cap, a client waiting 2²⁰ seconds (~12 days) is just a broken client. Without a max count, a client that never gives up ties up a thread, a connection, and potentially memory — indefinitely.

For services with many concurrent clients, a retry budget adds a second layer: a percentage ceiling on the total fraction of requests that may be retries at any given moment. If more than 10% of your outgoing traffic is retries, something is systemically wrong and further retrying is making it worse. Circuit breakers (see Lesson rel-06) handle this at a higher level.

Honoring Retry-After

When a server returns 429 Too Many Requests or 503 Service Unavailable, it often includes a Retry-After header whose value is either an integer number of seconds or an HTTP-date:

HTTP/1.1 429 Too Many Requests Retry-After: 60 X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1719523200

Ignoring Retry-After and retrying immediately is the fastest way to get your IP banned and escalate a rate-limit into a permanent block. Your backoff logic must check for this header and, if present, use its value as the floor for the wait, regardless of what your exponential schedule says.

Worked example: exponential backoff with full jitter

// Pseudo-code: retryWithBackoff
// Suitable for idempotent HTTP calls only.

function retryWithBackoff(request, options = {}) {
  const {
    maxAttempts = 4,      // stop after 4 tries (1 original + 3 retries)
    baseDelay   = 500,     // ms — first backoff window
    maxDelay    = 30_000,  // ms — cap at 30 s regardless of exponent
  } = options;

  const RETRYABLE = new Set([408, 429, 500, 502, 503, 504]);

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const response = await fetch(request);

    // Success or permanent failure — return immediately
    if (!RETRYABLE.has(response.status)) return response;

    // Last attempt — don't sleep, just surface the error
    if (attempt === maxAttempts - 1) throw new Error(`Failed after ${maxAttempts} attempts: ${response.status}`);

    // Honor Retry-After if present
    const retryAfterSec = response.headers.get("Retry-After");
    if (retryAfterSec) {
      await sleep(parseFloat(retryAfterSec) * 1000);
      continue;
    }

    // Exponential backoff with full jitter:
    // wait = random(0, min(cap, base * 2^attempt))
    const ceiling = Math.min(maxDelay, baseDelay * 2 ** attempt);
    const jitteredDelay = Math.random() * ceiling;  // uniform in [0, ceiling]
    await sleep(jitteredDelay);
  }
}

Walk through the attempts on a server that returns 503 for 3 seconds then recovers:

Attempt 0 — immediate. Gets 503. Ceiling = min(30 000, 500 × 1) = 500 ms. Waits ~0–500 ms.
Attempt 1 — ~250 ms later. Gets 503. Ceiling = min(30 000, 500 × 2) = 1 000 ms. Waits ~0–1 s.
Attempt 2 — ~0.5 s later. Server has recovered. Gets 200. Returns.

Total elapsed: roughly 1–2 s. Without retries, the caller would have surfaced an error to the user and abandoned the request.

🎯 Interview angle

"Your service calls a flaky downstream API. How do you make it resilient?" The expected answer hits four notes: (1) classify retryable status codes; (2) require idempotency on retried ops; (3) exponential backoff with full jitter to avoid thundering-herd; (4) cap attempts and honor Retry-After. Mentioning a retry budget or pairing with a circuit breaker (see rel-06) elevates the answer to senior level.

⚠️ Common trap

Retrying a non-idempotent operation. A checkout flow that calls POST /payments without an idempotency key and retries on 500 can charge a customer multiple times. The server may have successfully processed the first request and then failed writing the response. Always send an idempotency key and let the server deduplicate, never rely solely on status codes for payment-style mutations.

Retry storms. A misconfigured fleet of 500 clients, each retrying up to 10 times with no jitter, can multiply traffic by 10× during the exact window when the downstream is already struggling. This turns a 10-second hiccup into a 5-minute outage. Jitter and retry budgets are not nice-to-haves — they are production safety devices.

✅ Do this, not that

Do: randomize your backoff, cap your delay, cap your attempt count, honor Retry-After, and only retry idempotent (or idempotency-keyed) requests. Don't: retry immediately on failure, retry 4xx errors (except 408/429), or omit a maximum — a retry loop without a ceiling runs forever.

Under the hood: the exact backoff math

The pseudocode in the worked example above uses baseDelay * 2 ** attempt. Let's trace the arithmetic precisely for a baseDelay of 500 ms and a maxDelay cap of 30 000 ms, over five attempts.

Step 1 — uncapped exponential ceiling for each attempt n (0-indexed):

ceiling(n) = base × 2ⁿ

attempt 0:  ceiling = 500 × 2⁰ =    500 ms
attempt 1:  ceiling = 500 × 2¹ =  1 000 ms
attempt 2:  ceiling = 500 × 2² =  2 000 ms
attempt 3:  ceiling = 500 × 2³ =  4 000 ms
attempt 4:  ceiling = 500 × 2⁴ =  8 000 ms
attempt 5:  ceiling = 500 × 2⁵ = 16 000 ms
attempt 6:  ceiling = 500 × 2⁶ = 32 000 ms  → capped at 30 000 ms

Step 2 — apply the cap: capped_ceiling = min(maxDelay, base × 2ⁿ)

Step 3 — full jitter: draw a uniform random number in [0, capped_ceiling):

wait(n) = random_uniform(0, min(maxDelay, base × 2ⁿ))

A concrete five-attempt sequence with illustrative random draws:

Attempt	Uncapped ceiling (ms)	Capped ceiling (ms)	Jitter draw (ms)	Actual wait (ms)
0 (original)	—	—	—	0 (immediate)
1 (retry 1)	500	500	0.74 × 500	370
2 (retry 2)	1 000	1 000	0.22 × 1000	220
3 (retry 3)	2 000	2 000	0.88 × 2000	1 760
4 (retry 4)	4 000	4 000	0.41 × 4000	1 640
5 (retry 5)	8 000	8 000	0.06 × 8000	480

Total elapsed above: ~4.5 s for 5 retries. Notice that some retries happen faster than a naive schedule would dictate (retry 5 = 480 ms) — this is intentional: individual clients may recover quickly while the population spreads out. Without jitter, every client would see exactly {0, 500, 1000, 2000, 4000, 8000} ms — perfectly synchronized spikes.

Why jitter prevents synchronized retry storms. Consider 1 000 clients all failing at time T=0:

No jitter: at T+500 ms, all 1 000 send retry 1 simultaneously. The server, which is recovering, receives a 1 000-request spike — potentially re-triggering the failure. At T+1000 ms, same spike, now the server has to absorb a second wave while still recovering from the first.
Full jitter: at T+0 to T+500 ms, retries trickle in at roughly 2 requests/ms (1 000 spread over 500 ms). The server sees a smooth drizzle of ~2 RPS instead of 1 000-RPS spikes. A server recovering from overload handles 2 RPS fine; it cannot handle 1 000-RPS synchronized hammering.

The math: with full jitter, the expected wait for a single client on attempt n is min(maxDelay, base × 2ⁿ) / 2 — exactly half the ceiling on average. This is longer than the no-jitter fixed wait only when the cap isn't yet hit, but the population-level load reduction makes it strictly better for systems under stress.

⚠️ "Decorrelated jitter" vs. "full jitter"

Some implementations use decorrelated jitter: sleep = random(base, prev_sleep × 3). This produces a sequence uncorrelated across attempts (no client waits exactly the same times twice), which is good for certain distributed scenarios. AWS's Builders' Library explicitly prefers full jitter (random(0, min(cap, base × 2ⁿ))) for most API clients because it is simpler to reason about, produces a known average, and achieves equivalent spread. Use full jitter unless you have a specific reason for decorrelated.

How to debug & inspect it

A retry storm looks deceptively like a sudden traffic surge. The key diagnostic is the request multiplier: retries make your outgoing request volume larger than your incoming request volume. If you're receiving 100 RPS from users but sending 300 RPS to the downstream, you have a 3× multiplier — a strong indicator of aggressive retrying.

# Spot a retry storm: compare incoming vs. outgoing request rate $ curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total%7Bdirection%3D%22outgoing%22%7D%5B1m%5D)' | jq '.data.result[0].value[1]' "312.4" $ curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total%7Bdirection%3D%22incoming%22%7D%5B1m%5D)' | jq '.data.result[0].value[1]' "98.2" # Multiplier = 312 / 98 ≈ 3.2× → clients retrying ~2 times per original request # Healthy multiplier is ≈1.0–1.1 (almost no retries)

# Confirm only-retry-idempotent: inspect your retry config $ grep -n 'RETRYABLE\|retryable\|retry_on\|retry_status' src/http_client.ts 42: const RETRYABLE = new Set([408, 429, 500, 502, 503, 504]); # Verify: 400, 401, 403, 404, 422 are NOT in the set — if they are, remove them # Also check: POST/PATCH routes — are idempotency keys sent on all state-changing calls?

Symptom	Likely cause	Fix
Outgoing RPS ≫ incoming RPS (multiplier >1.5×)	Retry storm — too many retries per failure, no jitter, or no backoff cap	Add full jitter; reduce `maxAttempts`; add a retry budget (max 10% of traffic may be retries)
Downstream sees a traffic spike exactly N seconds after an outage starts	No jitter — all clients retry at the same exponential interval	Add `random(0, ceiling)` jitter; verify the jitter is applied before the sleep, not after
Customer charged twice / order created twice	Non-idempotent `POST` retried without an idempotency key	Generate a stable idempotency key per user action (UUID stored client-side); reuse it on every retry
Client immediately banned after hitting 429	`Retry-After` header ignored; client retried immediately	Read `Retry-After`; treat it as the floor for the next wait, overriding the computed backoff if shorter
Retry logic runs forever, blocking a thread	No `maxAttempts` cap or no `maxDelay` cap	Always set both; surface the error to the caller after exhausting attempts
"Infinite retry" — service restarts but retries resume from attempt 0	Attempt counter is in-memory, lost on restart	For long-running retries, persist the attempt count (e.g., in a job queue); use a dead-letter queue after N attempts

Retry-config review checklist:

Is the retryable status-code set explicit and correct? Confirm 4xx codes (except 408/429) are excluded.
Is every retried endpoint idempotent — either by HTTP semantics (GET/PUT/DELETE) or by an idempotency-key header?
Is full jitter applied? Run the backoff formula 10 times and verify the outputs are not identical.
Is there a maxAttempts cap (≤5 for most APIs) and a maxDelay cap (≤60 s)?
Does the code read Retry-After from 429/503 responses and honor it as a floor?
Is there a fleet-level retry budget (reject retries if >X% of outgoing traffic is already retries)?

By the numbers

Scenario: a payment microservice calls an external processor at a baseline of 500 req/s. Each call is configured with up to 3 retries (max 4 attempts total). During a 10-second partial outage the processor returns 503 on every attempt.

Backoff schedule: delay_n = min(cap, base · 2ⁿ)

With base = 100 ms and cap = 2 000 ms (2 s), the per-attempt ceiling and expected wait under full jitter (random(0, delay_n), so expected = delay_n / 2) are:

Attempt n	Uncapped (ms)	delay_n = min(cap, base·2ⁿ) (ms)	Full jitter: expected wait (ms)	Result (all 503)
0 — original	—	—	0 (immediate)	503 → retry
1 — retry 1	200	200	100	503 → retry
2 — retry 2	400	400	200	503 → retry
3 — retry 3	800	800	400	503 → give up

Expected elapsed per call: 0 + 100 + 200 + 400 = 700 ms average before the final failure surfaces. (Without jitter, the fixed sequence is 0 + 200 + 400 + 800 = 1 400 ms but all clients synchronize — the jittered version is faster on average and spreads load.)

If the cap is hit earlier (e.g. at n=4, base=100 ms, cap=2 s): attempts n≥4 all have delay_n = 2 000 ms and expected wait = 1 000 ms, so the schedule plateaus. Formula: delay_n = min(2000, 100 × 2ⁿ) hits the cap at n = log₂(2000/100) = log₂(20) ≈ 4.3, i.e. from attempt 5 onwards. See: AWS Builders’ Library — Timeouts, retries, and backoff with jitter.

Retry amplification: worst-case traffic multiplier

When the processor is overloaded, every request fails and every client retries up to 3 times. The outgoing request rate seen by the processor becomes:

amplified_QPS = baseline_QPS × (1 + max_retries) = 500 × (1 + 3) = 2 000 req/s # That is 4× the original load — hitting the processor that is already failing.

The table below shows how the multiplier scales with retry count and baseline load. The processor was already struggling at 500 req/s; it now receives 2 000 req/s, ensuring the outage extends far longer than the original trigger:

Baseline QPS	Max retries	Multiplier (1 + retries)	Amplified QPS	Effect
500	1	2×	1 000	Manageable surge
500	3	4×	2 000	Processor overwhelmed
500	9	10×	5 000	Catastrophic; deepens outage
1 000	3	4×	4 000	Cascading failure territory

This is why aggressive retry counts turn a brief blip into a prolonged outage: each failing request spawns N clones, all hitting the same struggling system simultaneously.

Decision math: retry budget — keeping the multiplier ≤ X

A retry budget caps the fraction of outgoing traffic that may be retries at any instant. If you want to limit the amplification multiplier to at most 1.10× (i.e. retries add no more than 10% overhead):

budget_fraction = (multiplier_target - 1) / multiplier_target = (1.10 - 1) / 1.10 = 9.1% # At most 9.1% of outgoing calls may be retries at any given moment. # If retries / total_outgoing > 9.1%, stop retrying and fail fast (or open the circuit breaker). # Example: 500 req/s baseline → allow at most ~45 retry req/s across the fleet.

Equivalently, for a target multiplier M, the maximum retry fraction is (M - 1) / M. For M = 1.5 that is 33%; for M = 2 it is 50%; for M = 4 it is 75% — already a red flag. Pair this with a circuit breaker (Lesson rel-06) so the breaker opens before the budget is exhausted.

🧠 Quick check

1. A client receives HTTP 404 Not Found. What should it do?

404 is a permanent, client-side error: the server understood the request and the resource simply isn't there. Retrying won't create the resource. Fix the URL or handle the missing resource in application logic.

2. Why is "full jitter" (random value in [0, ceiling]) preferred over retrying at the exact backoff interval?

Full jitter doesn't help one individual client — it may even make that client wait longer. Its value is collective: when many clients all fail at the same instant, jitter stops them from all retrying at the same instant, distributing load across the window instead of creating a synchronized spike.

3. A server returns HTTP 429 with header Retry-After: 45. The backoff formula computes a 3-second delay. How long should the client wait?

The server is the authority on its own rate limits. Retry-After communicates exactly when the server will accept another request. Ignoring it and retrying sooner achieves nothing and may escalate the block. Always treat Retry-After as a floor, not a suggestion.

4. Which condition makes retrying a POST /transfers call safe?

A network timeout doesn't mean the server didn't receive the request — it may have processed it and failed sending the response. The only safe retry of a state-changing operation is when the server deduplicates using an idempotency key, so a duplicate request is a no-op.

✍️ Exercise: design a retry policy for a payment service

You're building a microservice that calls an external payment processor. The processor can return 500, 503, 429, 400, and 402 Payment Required. Design the retry policy: which statuses retry, what are the backoff parameters, and what safeguards prevent a fleet of 200 service instances from amplifying an outage?

Model answer:

Retryable: 500, 503, 429 (server-side and transient). Not retryable: 400 (bad request — fix the payload), 402 (insufficient funds — retrying won't conjure money).
Idempotency key: generate a UUID per payment attempt, send as Idempotency-Key: <uuid>. Reuse the same key on retries so the processor deduplicates.
Backoff: base 500 ms, full jitter, cap 20 s. Max 4 attempts (1 original + 3 retries).
Retry-After: if 429 or 503 includes the header, sleep for max(Retry-After, jitteredDelay).
Fleet protection: configure a retry budget (e.g., no more than 10% of outgoing payment calls may be retries at any moment). If the budget is exhausted, fail fast and let the circuit breaker handle the rest.

Rubric: ✓ correctly excludes 400/402 ✓ idempotency key with reuse on retry ✓ full jitter mentioned ✓ Retry-After honored ✓ fleet-level safeguard (budget or circuit breaker) addressed. Four out of five = solid; five = excellent.

Key takeaways

Retry 5xx and transient errors; never retry 4xx (except 408/429) — those are the client's fault.
Idempotency is a prerequisite for retrying any state-changing operation; use idempotency keys.
Exponential backoff gives a recovering server breathing room; full jitter prevents the thundering herd.
Always set a max-attempts cap and a max-delay cap; unbounded retries are production bugs.
Always honor Retry-After; ignoring it is the fastest path to a permanent ban.
Retry budgets protect the fleet: if retries dominate traffic, you're amplifying the outage, not recovering from it.