API Design

Reliability & Scale · Lesson 05

Retries & Exponential Backoff

A failed request is not always a dead end — but blindly retrying is like jabbing an elevator button fifty times expecting a different result. The difference between a resilient system and a cascading disaster is knowing when to retry, how long to wait, and when to give up entirely.

⏱ 12 min Difficulty: core Prereq: Idempotency (rel-02), API Gateway (rel-04)

By the end you'll be able to

Why retries exist: the transient-failure problem

Networks are physical. Packets traverse routers, cross undersea cables, and land on servers that share CPUs with hundreds of other tenants. Occasionally a packet gets dropped, a TCP connection times out, or a server momentarily runs out of file descriptors. These faults are transient — they go away on their own within milliseconds to a few seconds. If you waited half a second and tried again, the request would have succeeded.

The analogy: imagine calling a friend. You hear three rings and then silence — not voicemail, just silence. That's a network glitch, not your friend refusing to talk to you. You redial. You don't interpret it as a permanent rejection and delete their number.

Retrying transient failures is one of the cheapest reliability gains available to a client. But the logic must be precise, because retrying the wrong thing at the wrong time destroys the system you are trying to protect.

The retry decision tree: what is safe?

The first gate is the HTTP status code.

Status Meaning Retry? Reason
408Request Timeout✅ YesServer never processed it; transient.
429Too Many Requests✅ Yes, with delayRate-limited; honor Retry-After.
500Internal Server Error✅ Yes (idempotent ops)Server-side fault; often transient.
502Bad Gateway✅ YesUpstream not reachable yet.
503Service Unavailable✅ Yes, with delayOverload; honor Retry-After.
504Gateway Timeout✅ YesUpstream too slow; may self-heal.
400Bad Request🚫 NeverYour payload is malformed; retrying is pointless.
401Unauthorized🚫 NeverFix your credentials first.
403Forbidden🚫 NeverPermissions issue; server won't change its mind.
404Not Found🚫 NeverResource doesn't exist; retrying wastes bandwidth.
422Unprocessable Entity🚫 NeverSemantic validation failure; fix the data.

The pattern: 4xx errors are the client's fault (except 408 and 429). The server understood the request and rejected it. Retrying the exact same bad request is futile. 5xx errors are the server's fault — the server failed to process a valid request, so retrying with an identical request can succeed once the server recovers.

The idempotency prerequisite

The table above says "✅ Yes (idempotent ops)" for 5xx. That parenthetical is load-bearing. Before retrying, you must know whether the operation is idempotent — doing it twice produces the same result as doing it once. A GET is idempotent. A PUT that replaces a resource is idempotent. A bare POST /orders that creates a new order is not — retrying it could charge a customer twice.

The solution is to make the server deduplicate using an idempotency key sent in a request header. The server stores the key and returns the cached response if it sees the same key again. See Lesson rel-02 (Idempotency) for the full pattern. The rule: if you cannot guarantee idempotency, do not retry.

Exponential backoff: waiting smarter

Once you've confirmed a retry is safe, the next question is when. Retrying immediately puts the same load back on a server that just failed. You need to wait. But how long?

Exponential backoff doubles the wait on each attempt:

The exponential growth gives a temporarily overloaded server room to breathe. But there is a hidden danger: if thousands of clients all failed at the same instant (say, during a brief hiccup), they will all wait the same amount and then all retry at exactly the same instant. This is the thundering herd problem, and it can turn a 2-second blip into a 20-minute outage.

Jitter: breaking the thundering herd

The fix is jitter — random noise added to the backoff. Instead of waiting exactly 4 seconds on attempt 3, each client waits a random value drawn uniformly from the range [0, 4 s]. The clients spread themselves across a 4-second window instead of spiking simultaneously. The server sees a smooth drizzle of requests instead of a hammer blow.

AWS's builders' library calls this "full jitter" and recommends it over "equal jitter" (which only randomizes half the interval) for most workloads, because it maximises the spread under load.

Without jitter — all clients retry together (thundering herd) 0 s 1 s 2 s 4 s 8 s outage spike bigger spike All N clients retry at t=1, t=2, t=4… With full jitter — retries spread across the window 0 s 1 s 2 s 4 s 8 s outage Clients spread across [0, max_delay] Server sees a manageable drizzle, not a wave Retry storm: retries amplify the failure signal N real requests Outage N failures Retry N×3 requests Still failing N×9 requests outage extends
Top: without jitter, retries arrive as synchronized spikes that overwhelm a recovering server. Middle: jitter spreads load smoothly. Bottom: retry storms, where each wave of retries generates more failures and more retries, can extend an outage far beyond its original cause.

Retry caps and budgets

Exponential backoff must have a maximum delay cap (e.g., 30 s) and a maximum attempt count. Without the cap, a client waiting 220 seconds (~12 days) is just a broken client. Without a max count, a client that never gives up ties up a thread, a connection, and potentially memory — indefinitely.

For services with many concurrent clients, a retry budget adds a second layer: a percentage ceiling on the total fraction of requests that may be retries at any given moment. If more than 10% of your outgoing traffic is retries, something is systemically wrong and further retrying is making it worse. Circuit breakers (see Lesson rel-06) handle this at a higher level.

Honoring Retry-After

When a server returns 429 Too Many Requests or 503 Service Unavailable, it often includes a Retry-After header whose value is either an integer number of seconds or an HTTP-date:

HTTP/1.1 429 Too Many Requests Retry-After: 60 X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1719523200

Ignoring Retry-After and retrying immediately is the fastest way to get your IP banned and escalate a rate-limit into a permanent block. Your backoff logic must check for this header and, if present, use its value as the floor for the wait, regardless of what your exponential schedule says.

Worked example: exponential backoff with full jitter

// Pseudo-code: retryWithBackoff
// Suitable for idempotent HTTP calls only.

function retryWithBackoff(request, options = {}) {
  const {
    maxAttempts = 4,      // stop after 4 tries (1 original + 3 retries)
    baseDelay   = 500,     // ms — first backoff window
    maxDelay    = 30_000,  // ms — cap at 30 s regardless of exponent
  } = options;

  const RETRYABLE = new Set([408, 429, 500, 502, 503, 504]);

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const response = await fetch(request);

    // Success or permanent failure — return immediately
    if (!RETRYABLE.has(response.status)) return response;

    // Last attempt — don't sleep, just surface the error
    if (attempt === maxAttempts - 1) throw new Error(`Failed after ${maxAttempts} attempts: ${response.status}`);

    // Honor Retry-After if present
    const retryAfterSec = response.headers.get("Retry-After");
    if (retryAfterSec) {
      await sleep(parseFloat(retryAfterSec) * 1000);
      continue;
    }

    // Exponential backoff with full jitter:
    // wait = random(0, min(cap, base * 2^attempt))
    const ceiling = Math.min(maxDelay, baseDelay * 2 ** attempt);
    const jitteredDelay = Math.random() * ceiling;  // uniform in [0, ceiling]
    await sleep(jitteredDelay);
  }
}

Walk through the attempts on a server that returns 503 for 3 seconds then recovers:

  1. Attempt 0 — immediate. Gets 503. Ceiling = min(30 000, 500 × 1) = 500 ms. Waits ~0–500 ms.
  2. Attempt 1 — ~250 ms later. Gets 503. Ceiling = min(30 000, 500 × 2) = 1 000 ms. Waits ~0–1 s.
  3. Attempt 2 — ~0.5 s later. Server has recovered. Gets 200. Returns.

Total elapsed: roughly 1–2 s. Without retries, the caller would have surfaced an error to the user and abandoned the request.

🎯 Interview angle

"Your service calls a flaky downstream API. How do you make it resilient?" The expected answer hits four notes: (1) classify retryable status codes; (2) require idempotency on retried ops; (3) exponential backoff with full jitter to avoid thundering-herd; (4) cap attempts and honor Retry-After. Mentioning a retry budget or pairing with a circuit breaker (see rel-06) elevates the answer to senior level.

⚠️ Common trap

Retrying a non-idempotent operation. A checkout flow that calls POST /payments without an idempotency key and retries on 500 can charge a customer multiple times. The server may have successfully processed the first request and then failed writing the response. Always send an idempotency key and let the server deduplicate, never rely solely on status codes for payment-style mutations.

Retry storms. A misconfigured fleet of 500 clients, each retrying up to 10 times with no jitter, can multiply traffic by 10× during the exact window when the downstream is already struggling. This turns a 10-second hiccup into a 5-minute outage. Jitter and retry budgets are not nice-to-haves — they are production safety devices.

✅ Do this, not that

Do: randomize your backoff, cap your delay, cap your attempt count, honor Retry-After, and only retry idempotent (or idempotency-keyed) requests. Don't: retry immediately on failure, retry 4xx errors (except 408/429), or omit a maximum — a retry loop without a ceiling runs forever.

Under the hood: the exact backoff math

The pseudocode in the worked example above uses baseDelay * 2 ** attempt. Let's trace the arithmetic precisely for a baseDelay of 500 ms and a maxDelay cap of 30 000 ms, over five attempts.

Step 1 — uncapped exponential ceiling for each attempt n (0-indexed):

ceiling(n) = base × 2ⁿ

attempt 0:  ceiling = 500 × 2⁰ =    500 ms
attempt 1:  ceiling = 500 × 2¹ =  1 000 ms
attempt 2:  ceiling = 500 × 2² =  2 000 ms
attempt 3:  ceiling = 500 × 2³ =  4 000 ms
attempt 4:  ceiling = 500 × 2⁴ =  8 000 ms
attempt 5:  ceiling = 500 × 2⁵ = 16 000 ms
attempt 6:  ceiling = 500 × 2⁶ = 32 000 ms  → capped at 30 000 ms

Step 2 — apply the cap: capped_ceiling = min(maxDelay, base × 2ⁿ)

Step 3 — full jitter: draw a uniform random number in [0, capped_ceiling):

wait(n) = random_uniform(0, min(maxDelay, base × 2ⁿ))

A concrete five-attempt sequence with illustrative random draws:

AttemptUncapped ceiling (ms)Capped ceiling (ms)Jitter draw (ms)Actual wait (ms)
0 (original)0 (immediate)
1 (retry 1)5005000.74 × 500370
2 (retry 2)1 0001 0000.22 × 1000220
3 (retry 3)2 0002 0000.88 × 20001 760
4 (retry 4)4 0004 0000.41 × 40001 640
5 (retry 5)8 0008 0000.06 × 8000480

Total elapsed above: ~4.5 s for 5 retries. Notice that some retries happen faster than a naive schedule would dictate (retry 5 = 480 ms) — this is intentional: individual clients may recover quickly while the population spreads out. Without jitter, every client would see exactly {0, 500, 1000, 2000, 4000, 8000} ms — perfectly synchronized spikes.

Why jitter prevents synchronized retry storms. Consider 1 000 clients all failing at time T=0:

The math: with full jitter, the expected wait for a single client on attempt n is min(maxDelay, base × 2ⁿ) / 2 — exactly half the ceiling on average. This is longer than the no-jitter fixed wait only when the cap isn't yet hit, but the population-level load reduction makes it strictly better for systems under stress.

⚠️ "Decorrelated jitter" vs. "full jitter"

Some implementations use decorrelated jitter: sleep = random(base, prev_sleep × 3). This produces a sequence uncorrelated across attempts (no client waits exactly the same times twice), which is good for certain distributed scenarios. AWS's Builders' Library explicitly prefers full jitter (random(0, min(cap, base × 2ⁿ))) for most API clients because it is simpler to reason about, produces a known average, and achieves equivalent spread. Use full jitter unless you have a specific reason for decorrelated.

How to debug & inspect it

A retry storm looks deceptively like a sudden traffic surge. The key diagnostic is the request multiplier: retries make your outgoing request volume larger than your incoming request volume. If you're receiving 100 RPS from users but sending 300 RPS to the downstream, you have a 3× multiplier — a strong indicator of aggressive retrying.

# Spot a retry storm: compare incoming vs. outgoing request rate $ curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total%7Bdirection%3D%22outgoing%22%7D%5B1m%5D)' | jq '.data.result[0].value[1]' "312.4" $ curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total%7Bdirection%3D%22incoming%22%7D%5B1m%5D)' | jq '.data.result[0].value[1]' "98.2" # Multiplier = 312 / 98 ≈ 3.2× → clients retrying ~2 times per original request # Healthy multiplier is ≈1.0–1.1 (almost no retries)
# Confirm only-retry-idempotent: inspect your retry config $ grep -n 'RETRYABLE\|retryable\|retry_on\|retry_status' src/http_client.ts 42: const RETRYABLE = new Set([408, 429, 500, 502, 503, 504]); # Verify: 400, 401, 403, 404, 422 are NOT in the set — if they are, remove them # Also check: POST/PATCH routes — are idempotency keys sent on all state-changing calls?
SymptomLikely causeFix
Outgoing RPS ≫ incoming RPS (multiplier >1.5×)Retry storm — too many retries per failure, no jitter, or no backoff capAdd full jitter; reduce maxAttempts; add a retry budget (max 10% of traffic may be retries)
Downstream sees a traffic spike exactly N seconds after an outage startsNo jitter — all clients retry at the same exponential intervalAdd random(0, ceiling) jitter; verify the jitter is applied before the sleep, not after
Customer charged twice / order created twiceNon-idempotent POST retried without an idempotency keyGenerate a stable idempotency key per user action (UUID stored client-side); reuse it on every retry
Client immediately banned after hitting 429Retry-After header ignored; client retried immediatelyRead Retry-After; treat it as the floor for the next wait, overriding the computed backoff if shorter
Retry logic runs forever, blocking a threadNo maxAttempts cap or no maxDelay capAlways set both; surface the error to the caller after exhausting attempts
"Infinite retry" — service restarts but retries resume from attempt 0Attempt counter is in-memory, lost on restartFor long-running retries, persist the attempt count (e.g., in a job queue); use a dead-letter queue after N attempts

Retry-config review checklist:

  1. Is the retryable status-code set explicit and correct? Confirm 4xx codes (except 408/429) are excluded.
  2. Is every retried endpoint idempotent — either by HTTP semantics (GET/PUT/DELETE) or by an idempotency-key header?
  3. Is full jitter applied? Run the backoff formula 10 times and verify the outputs are not identical.
  4. Is there a maxAttempts cap (≤5 for most APIs) and a maxDelay cap (≤60 s)?
  5. Does the code read Retry-After from 429/503 responses and honor it as a floor?
  6. Is there a fleet-level retry budget (reject retries if >X% of outgoing traffic is already retries)?

By the numbers

Scenario: a payment microservice calls an external processor at a baseline of 500 req/s. Each call is configured with up to 3 retries (max 4 attempts total). During a 10-second partial outage the processor returns 503 on every attempt.

Backoff schedule: delay_n = min(cap, base · 2n)

With base = 100 ms and cap = 2 000 ms (2 s), the per-attempt ceiling and expected wait under full jitter (random(0, delay_n), so expected = delay_n / 2) are:

Attempt nUncapped (ms)delay_n = min(cap, base·2n) (ms)Full jitter: expected wait (ms)Result (all 503)
0 — original0 (immediate)503 → retry
1 — retry 1200200100503 → retry
2 — retry 2400400200503 → retry
3 — retry 3800800400503 → give up

Expected elapsed per call: 0 + 100 + 200 + 400 = 700 ms average before the final failure surfaces. (Without jitter, the fixed sequence is 0 + 200 + 400 + 800 = 1 400 ms but all clients synchronize — the jittered version is faster on average and spreads load.)

If the cap is hit earlier (e.g. at n=4, base=100 ms, cap=2 s): attempts n≥4 all have delay_n = 2 000 ms and expected wait = 1 000 ms, so the schedule plateaus. Formula: delay_n = min(2000, 100 × 2n) hits the cap at n = log2(2000/100) = log2(20) ≈ 4.3, i.e. from attempt 5 onwards. See: AWS Builders’ Library — Timeouts, retries, and backoff with jitter.

Retry amplification: worst-case traffic multiplier

When the processor is overloaded, every request fails and every client retries up to 3 times. The outgoing request rate seen by the processor becomes:

amplified_QPS = baseline_QPS × (1 + max_retries) = 500 × (1 + 3) = 2 000 req/s # That is 4× the original load — hitting the processor that is already failing.

The table below shows how the multiplier scales with retry count and baseline load. The processor was already struggling at 500 req/s; it now receives 2 000 req/s, ensuring the outage extends far longer than the original trigger:

Baseline QPSMax retriesMultiplier (1 + retries)Amplified QPSEffect
50011 000Manageable surge
50032 000Processor overwhelmed
500910×5 000Catastrophic; deepens outage
1 00034 000Cascading failure territory

This is why aggressive retry counts turn a brief blip into a prolonged outage: each failing request spawns N clones, all hitting the same struggling system simultaneously.

Decision math: retry budget — keeping the multiplier ≤ X

A retry budget caps the fraction of outgoing traffic that may be retries at any instant. If you want to limit the amplification multiplier to at most 1.10× (i.e. retries add no more than 10% overhead):

budget_fraction = (multiplier_target - 1) / multiplier_target = (1.10 - 1) / 1.10 = 9.1% # At most 9.1% of outgoing calls may be retries at any given moment. # If retries / total_outgoing > 9.1%, stop retrying and fail fast (or open the circuit breaker). # Example: 500 req/s baseline → allow at most ~45 retry req/s across the fleet.

Equivalently, for a target multiplier M, the maximum retry fraction is (M - 1) / M. For M = 1.5 that is 33%; for M = 2 it is 50%; for M = 4 it is 75% — already a red flag. Pair this with a circuit breaker (Lesson rel-06) so the breaker opens before the budget is exhausted.

🧠 Quick check

1. A client receives HTTP 404 Not Found. What should it do?

404 is a permanent, client-side error: the server understood the request and the resource simply isn't there. Retrying won't create the resource. Fix the URL or handle the missing resource in application logic.

2. Why is "full jitter" (random value in [0, ceiling]) preferred over retrying at the exact backoff interval?

Full jitter doesn't help one individual client — it may even make that client wait longer. Its value is collective: when many clients all fail at the same instant, jitter stops them from all retrying at the same instant, distributing load across the window instead of creating a synchronized spike.

3. A server returns HTTP 429 with header Retry-After: 45. The backoff formula computes a 3-second delay. How long should the client wait?

The server is the authority on its own rate limits. Retry-After communicates exactly when the server will accept another request. Ignoring it and retrying sooner achieves nothing and may escalate the block. Always treat Retry-After as a floor, not a suggestion.

4. Which condition makes retrying a POST /transfers call safe?

A network timeout doesn't mean the server didn't receive the request — it may have processed it and failed sending the response. The only safe retry of a state-changing operation is when the server deduplicates using an idempotency key, so a duplicate request is a no-op.

✍️ Exercise: design a retry policy for a payment service

You're building a microservice that calls an external payment processor. The processor can return 500, 503, 429, 400, and 402 Payment Required. Design the retry policy: which statuses retry, what are the backoff parameters, and what safeguards prevent a fleet of 200 service instances from amplifying an outage?

Model answer:

Rubric: ✓ correctly excludes 400/402 ✓ idempotency key with reuse on retry ✓ full jitter mentioned ✓ Retry-After honored ✓ fleet-level safeguard (budget or circuit breaker) addressed. Four out of five = solid; five = excellent.

Key takeaways

Sources & further reading