Debugging & Real-World · Lesson 06
Handling 429s & throttling
A 429 Too Many Requests response is not a bug — it's a boundary marker. The API is telling you exactly how much it can absorb. Getting 429s means you've hit that boundary; handling them correctly means you stay on the right side of it permanently.
By the end you'll be able to
- Read the rate-limit headers in a 429 response and calculate when to retry.
- Implement exponential backoff with jitter so retries spread out rather than stampede.
- Apply batching, caching, and client-side limiting to reduce call volume before hitting the server's limit.
Why you're getting 429s
Rate limiting is the API provider's mechanism for distributing capacity fairly across all callers and protecting the service from overload. See Lesson rel-03 for the full taxonomy of rate-limiting algorithms. For debugging purposes, the most important question is: which limit did you hit? APIs commonly enforce limits at multiple granularities simultaneously:
- Per-second or per-minute limit — a burst limit that prevents a single caller from flooding the service in a short window. You're sending too many requests too quickly.
- Per-day or per-month limit — a quota limit based on a billing tier. You've used up your allocation for the current period.
- Per-endpoint limit — some endpoints (e.g., a search endpoint that's expensive to serve) have lower limits than others.
- Per-user or per-key limit — your API key may have a different limit than the global limit; or a specific user's actions may be throttled independently of your overall quota.
The rate-limit headers in the response tell you which limit was hit and how to calculate the wait time.
Reading the rate-limit headers
Your data-sync job runs every hour and fetches updated records from an external CRM API. Starting at 03:00, the job begins hitting 429s after about 90 requests. The errors stop at 03:01. The pattern repeats every hour.
The first step is to read the response headers from the 429:
| Header | What it means | How to use it |
|---|---|---|
X-RateLimit-Limit |
The total number of requests allowed in the window (here: 100 per minute) | Your upper bound — never send more than this in one window |
X-RateLimit-Remaining |
Requests remaining in the current window | Slow down proactively when this gets low — don't wait for 0 |
X-RateLimit-Reset |
Unix timestamp when the current window resets (here: 47 seconds from now) | The earliest safe time to resume at full rate |
Retry-After |
Seconds to wait before retrying (or an HTTP date). Some APIs send only this; some send both | Wait at least this long before retrying — prefer Retry-After over a calculated wait if both are present |
In the scenario above: 100 requests per minute, you're sending ~90 per minute from the sync job. Why are you hitting the limit? The job sends all 90 requests in the first 5 seconds of each minute, not spread over 60 seconds. From the API's perspective, that's 90 requests in 5 seconds — potentially triggering a per-second burst limit even if the per-minute quota wouldn't be exceeded. The fix is not to reduce total volume, but to spread the requests over time.
Exponential backoff with jitter
When you receive a 429, the worst response is to retry immediately. If multiple instances of your service all hit the rate limit at the same moment and all retry immediately, they'll all hit it again at the same moment — a thundering herd that extends the outage. Exponential backoff spreads retries out; jitter breaks the synchrony between instances.
An immediate retry on a 429 sends another request while the rate-limit window is still active — it will also get a 429. Worse, if you have 10 service instances and all of them immediately retry after receiving a 429, you've sent 10x the load into an already-overloaded window. Always wait at least the Retry-After duration. Backoff is not optional.
Reduce call volume: batch and cache
Backoff handles the "what to do when you've already hit the limit" case. The better strategy is to not hit the limit in the first place. Two techniques cut call volume significantly:
Batch requests
Many APIs support fetching multiple resources in a single call. Instead of calling GET /users/{id} once per user ID, check whether the API offers a batch endpoint like GET /users?ids=1,2,3,4,5 or POST /batch/users. One batch request that fetches 50 records costs 1 rate-limit unit, not 50.
Cache responses
If you're calling the same endpoint for the same resource repeatedly, cache the response. Use the Cache-Control and ETag headers from the API response to determine how long the cache is valid and when to revalidate. Even a 60-second in-process cache on a high-frequency operation can reduce call volume by orders of magnitude.
Add a client-side rate limiter
A client-side limiter prevents your code from ever sending more than N requests per second, regardless of how much traffic the rest of your application generates. This is proactive throttling: you enforce the limit yourself before the server has to. It's especially valuable in bulk-processing jobs where you know the total volume will exceed the quota.
When to request a quota increase
Sometimes the right answer is not to optimise your code — it's to ask the provider for a higher quota. Signs you've done the engineering work and genuinely need more:
- You've batched everything batchable.
- You've added caching where the data allows it.
- You've added a client-side limiter and you're still hitting the limit at the rate your business requires.
- The limit is a billing-tier limit, not a technical one.
When you contact the provider, give them specifics: current limit, current usage, required throughput, and what you've already done to reduce volume. Providers are more likely to grant increases to callers who demonstrate they've already optimised.
Putting it together: handling a 429 in production
- Receive the 429. Read
Retry-After(preferred) orX-RateLimit-Reset. Calculate the wait time. - Do not retry immediately. Log the 429 with the wait time, rate-limit headers, and the endpoint that was called. This log is essential for diagnosing which limit was hit.
- Wait with jitter. Sleep for
Retry-After+ random(0, 0.5 × Retry-After). The jitter prevents a stampede if multiple instances hit the limit at the same time. - Retry with backoff. If the first retry also returns 429, double the wait. Cap the maximum wait at a sensible ceiling (60–120 s).
- After N retries, surface the error. Don't retry forever — eventually return the 429 error to the caller so they can handle it. Alert if 429s are sustained (more than X per minute for more than Y minutes) — that's a sign your traffic pattern has changed.
- Review the access patterns that generated the 429s. Were they from a bulk job? Add batching and a client-side limiter to that job. Were they from real-time user traffic? Consider caching upstream or request coalescing.
Rate-limit handling is a favourite system-design follow-up: "Your service is getting 429s from a third-party API — what do you do?" The complete answer covers four things: (1) read the headers to know the limit type; (2) implement backoff with jitter; (3) reduce call volume with batching and caching; (4) add a client-side limiter. Candidates who only say "add a retry with backoff" are giving a partial answer — the retry handles recovery; the other three prevent you from needing to recover. Link to rel-03 for the full rate-limiting algorithm discussion.
Many well-designed clients only check rate-limit headers on 429 responses. A better pattern: read X-RateLimit-Remaining on every 200 response and slow down proactively when the remaining count gets low (e.g., below 10% of the limit). This prevents the 429 from happening in the first place and gives you a smoother request pattern.
Under the hood: how it actually works
The 429 headers are not guesses — the server derives them from its internal rate-limiting state. Understanding that derivation tells you exactly what each number means and how a correct client should read and act on it.
How the server computes remaining quota: token bucket and fixed window
Most APIs use one of two algorithms to track quota. The values in the response headers directly reflect the algorithm's state at the moment of your request.
Token bucket. A bucket holds up to capacity tokens. One token is consumed per request. Tokens refill continuously at a rate of capacity / window tokens per second. If the bucket is empty, the request is rejected with 429. This permits short bursts up to capacity while enforcing a steady-state throughput ceiling. The server computes X-RateLimit-Remaining as floor(current_tokens).
Fixed window. A counter resets to 0 at each window boundary (e.g., the start of each minute). Each request increments the counter. When counter == limit, subsequent requests in that window are rejected. X-RateLimit-Remaining = limit - counter. X-RateLimit-Reset is the Unix timestamp of the next window boundary. This is simpler but allows a burst of 2× the limit if requests straddle two windows (the last second of window N + the first second of window N+1).
Exact header semantics: where each value comes from
| Header | Algorithm source | What it encodes |
|---|---|---|
X-RateLimit-Limit |
Configured capacity | The maximum tokens (bucket) or requests per window (fixed) for this key/tier |
X-RateLimit-Remaining |
Live counter/bucket level at request time | How many more requests are safe in the current window without a 429 |
X-RateLimit-Reset |
Window boundary timestamp | Unix seconds when the counter resets to 0 (fixed window) or bucket fills to capacity (token bucket) |
Retry-After |
Derived: Reset − now() or explicit |
Seconds until the server will accept new requests again; may be an HTTP-date string instead |
On a 429 the server still writes these headers from its internal state — so Remaining will be 0, and Reset/Retry-After tells you exactly how many seconds until the counter resets. On a 200 the same headers reflect remaining capacity after deducting the current request.
How a correct client reads Reset/Retry-After and backs off with jitter
The algorithm a well-behaved client should follow on every response:
# Pseudocode — run after every HTTP response
def handle_response(resp, attempt):
remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
limit = int(resp.headers.get("X-RateLimit-Limit", 999))
# Proactive slow-down on successful responses (before hitting 0)
if resp.status_code == 200 and remaining < limit * 0.10:
wait_until_reset(resp) # or reduce request rate
if resp.status_code != 429:
return resp # success path
# 429: compute the mandatory floor from server hints
retry_after = resp.headers.get("Retry-After")
if retry_after:
mandatory_floor = float(retry_after)
else:
reset_ts = int(resp.headers.get("X-RateLimit-Reset", 0))
mandatory_floor = max(0, reset_ts - time.time())
# Add exponential backoff on top of the mandatory floor
base_backoff = min(60, 1 * (2 ** attempt)) # 1, 2, 4, 8... capped at 60 s
wait_seconds = max(mandatory_floor, base_backoff)
# Full jitter: randomise within [0, wait_seconds] to de-synchronise instances
import random
jitter = random.uniform(0, wait_seconds)
time.sleep(jitter)
return None # signal caller to retry
The key points: Retry-After (when present) is the server's authoritative floor — never wait less. Exponential backoff adds a delay on top of that floor for repeated 429s. Jitter randomises across instances.
Worked trace: requests crossing the limit and client recovery
X-RateLimit-Reset carries a Unix timestamp (seconds since epoch), not a countdown. If you store the raw value and compare it to time.time() you need to subtract. A common off-by-one bug: treating the header as seconds-to-wait and sleeping for 1,718,388,060 seconds (55,000+ years). Always compute wait = max(0, reset_timestamp - time.time()). Also note: some APIs send Retry-After as an HTTP-date string (Fri, 14 Jun 2024 03:02:00 GMT) rather than an integer — parse accordingly.
- RFC 6585 §4 — 429 Too Many Requests (original specification)
- MDN — Retry-After header
- AWS Builder's Library — Timeouts, retries, and backoff with jitter
🧠 Quick check
1. You receive a 429 with Retry-After: 30. Your retry logic retries immediately because the request is "urgent." What happens?
The rate-limit window resets at the time indicated by Retry-After (30 seconds). An immediate retry is within the same window and will also be rejected. With multiple instances, you've sent N×2 requests into an already-rate-limited window. Always honour the Retry-After delay.
2. Why add jitter to the backoff wait time?
If 10 instances all hit the rate limit at the same time, they'll all sleep for the same calculated duration. When the sleep ends, all 10 fire simultaneously — creating a new burst at the rate-limit boundary. Jitter randomises each instance's wait time so they spread out naturally over the reset window.
3. Your sync job fetches 500 user profiles by calling GET /users/{id} in a loop. The API allows 100 requests per minute. The job takes 5+ minutes and hits 429s. What is the fastest fix that doesn't require contacting the provider?
A batch endpoint is the highest-leverage fix — if the API supports fetching 50 users in one call, you reduce 500 requests to 10, which is well under any per-minute limit. A per-request sleep (option B) still uses 500 requests but spreads them; it works but wastes capacity. Option C doesn't work — rate limits are typically per-API-key, so multiple servers sharing a key share the same limit.
4. X-RateLimit-Remaining: 3 appears in a 200 response. What should a well-designed client do?
Reading rate-limit headers on successful responses lets you act before hitting the limit rather than after. With only 3 requests remaining, the next burst of traffic will trigger a 429. Throttling proactively — sleeping briefly, queuing requests, or prioritising urgent calls — keeps you under the limit without the disruption of a 429 and retry cycle.
✍️ Exercise: design a rate-limit-aware data pipeline
You're building a nightly data-sync pipeline that fetches ~10,000 records from a third-party CRM API. The CRM limits you to 500 requests per minute and 50,000 requests per day. Each record requires one API call. Design the pipeline to avoid 429s and handle them gracefully if they occur anyway. Consider: batching, scheduling, client-side limiting, backoff, and quota tracking.
Model answer:
- Check for a batch endpoint. If the CRM API supports fetching 50 records per call, 10,000 records = 200 requests — well within both limits. Batching is the single most impactful change. Always check before building any other rate-limit machinery.
- Spread the job over time. Even without batching, 10,000 requests at 500/min takes 20 minutes. Schedule the job to start with enough runway — don't start at 23:40 if it takes 20 minutes; start at 22:00. This also leaves capacity for other processes using the same API key.
- Add a client-side limiter at 450/min (90% of the limit). Leave 10% headroom for other code paths and measurement uncertainty. The limiter ensures the pipeline never sends more than 450 requests per minute, regardless of how fast the job processes records.
- Track daily quota usage. With a 50,000/day limit, the pipeline consumes 10,000 (or 200, if batching). Track total daily usage in a counter (Redis, a database) and alert when usage exceeds 80% of the daily quota. This prevents other processes from exhausting the quota before the nightly job runs.
- Implement backoff for unexpected 429s. Even with proactive limiting, 429s can occur (e.g., the CRM temporarily lowers limits during maintenance). Implement exponential backoff with jitter (base 1 s, max 60 s, respecting
Retry-After). Failed records should be queued for retry at the end of the run rather than blocking the main pipeline.
Rubric: ✓ Batching mentioned first ✓ Client-side limiter below 100% of the API limit ✓ Daily quota tracking ✓ Backoff with jitter ✓ Failed-request queue for retries (not blocking the pipeline).
Key takeaways
- Read
Retry-AfterandX-RateLimit-*headers before writing any retry logic — the server is telling you exactly how long to wait. - Never retry a 429 immediately. Wait at least the
Retry-Afterduration; use exponential backoff + jitter for subsequent retries. - Jitter breaks the thundering-herd problem: without it, all your instances retry at the same instant and hit the limit again.
- Reduce call volume before the 429 arrives: use batch endpoints, cache responses, and read rate-limit headers on successful responses to slow down proactively.
- A client-side rate limiter is proactive — it enforces the limit yourself, before the server has to.
- If you've done the engineering work and genuinely need more throughput, request a quota increase from the provider with specifics: current limit, usage, and what you've already optimised.