API Design

Debugging & Real-World · Lesson 06

Handling 429s & throttling

A 429 Too Many Requests response is not a bug — it's a boundary marker. The API is telling you exactly how much it can absorb. Getting 429s means you've hit that boundary; handling them correctly means you stay on the right side of it permanently.

⏱ 12 min Difficulty: core Prereq: dbg-01, dbg-02, rel-03

By the end you'll be able to

Why you're getting 429s

Rate limiting is the API provider's mechanism for distributing capacity fairly across all callers and protecting the service from overload. See Lesson rel-03 for the full taxonomy of rate-limiting algorithms. For debugging purposes, the most important question is: which limit did you hit? APIs commonly enforce limits at multiple granularities simultaneously:

The rate-limit headers in the response tell you which limit was hit and how to calculate the wait time.

Reading the rate-limit headers

Scenario

Your data-sync job runs every hour and fetches updated records from an external CRM API. Starting at 03:00, the job begins hitting 429s after about 90 requests. The errors stop at 03:01. The pattern repeats every hour.

The first step is to read the response headers from the 429:

HTTP/2 429 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1718388060 Retry-After: 47 Content-Type: application/json {"error":"rate_limit_exceeded","code":"too_many_requests"}
HeaderWhat it meansHow to use it
X-RateLimit-Limit The total number of requests allowed in the window (here: 100 per minute) Your upper bound — never send more than this in one window
X-RateLimit-Remaining Requests remaining in the current window Slow down proactively when this gets low — don't wait for 0
X-RateLimit-Reset Unix timestamp when the current window resets (here: 47 seconds from now) The earliest safe time to resume at full rate
Retry-After Seconds to wait before retrying (or an HTTP date). Some APIs send only this; some send both Wait at least this long before retrying — prefer Retry-After over a calculated wait if both are present

In the scenario above: 100 requests per minute, you're sending ~90 per minute from the sync job. Why are you hitting the limit? The job sends all 90 requests in the first 5 seconds of each minute, not spread over 60 seconds. From the API's perspective, that's 90 requests in 5 seconds — potentially triggering a per-second burst limit even if the per-minute quota wouldn't be exceeded. The fix is not to reduce total volume, but to spread the requests over time.

Exponential backoff with jitter

When you receive a 429, the worst response is to retry immediately. If multiple instances of your service all hit the rate limit at the same moment and all retry immediately, they'll all hit it again at the same moment — a thundering herd that extends the outage. Exponential backoff spreads retries out; jitter breaks the synchrony between instances.

0 s ~1 s ~2–4 s ~4–8 s ~8–16 s 429 429 wait 1 s + jitter 429 wait 2–4 s + jitter 429 wait 4–8 s + jitter 200 wait 8–16 s + jitter Wait = min(cap, base × 2ⁿ) + random(0, wait×0.5) — "full jitter"
Each retry waits exponentially longer. Jitter — a random fraction of the wait — spreads retries from different clients so they don't all fire at the same instant when the window resets.
# Python: exponential backoff with full jitter for 429s import time, random, requests def call_with_backoff(method, url, **kwargs): base_wait = 1.0 # seconds before first retry max_wait = 60.0 # never wait longer than this max_retries = 6 for attempt in range(max_retries + 1): resp = requests.request(method, url, **kwargs) if resp.status_code != 429: return resp # success (or a non-429 error — don't retry those here) if attempt == max_retries: return resp # give up, return the 429 to the caller # Respect Retry-After if present (always take the server's word) retry_after = resp.headers.get("Retry-After") if retry_after: wait = float(retry_after) else: # Exponential backoff: 1, 2, 4, 8, 16, 32 seconds wait = min(max_wait, base_wait * (2 ** attempt)) # Full jitter: randomize between 0 and the calculated wait # This prevents N clients from all retrying at the same moment jittered_wait = random.uniform(0, wait) time.sleep(jittered_wait) return resp # Usage: resp = call_with_backoff("GET", "https://api.example.com/v1/records")
⚠️ Never retry a 429 immediately

An immediate retry on a 429 sends another request while the rate-limit window is still active — it will also get a 429. Worse, if you have 10 service instances and all of them immediately retry after receiving a 429, you've sent 10x the load into an already-overloaded window. Always wait at least the Retry-After duration. Backoff is not optional.

Reduce call volume: batch and cache

Backoff handles the "what to do when you've already hit the limit" case. The better strategy is to not hit the limit in the first place. Two techniques cut call volume significantly:

Batch requests

Many APIs support fetching multiple resources in a single call. Instead of calling GET /users/{id} once per user ID, check whether the API offers a batch endpoint like GET /users?ids=1,2,3,4,5 or POST /batch/users. One batch request that fetches 50 records costs 1 rate-limit unit, not 50.

# Inefficient: one request per ID — burns 50 rate-limit units for id in user_ids[:50]: resp = requests.get(f"https://api.example.com/v1/users/{id}", headers=auth) process(resp.json()) # Efficient: batch endpoint — burns 1 rate-limit unit ids_param = ",".join(str(i) for i in user_ids[:50]) resp = requests.get( "https://api.example.com/v1/users", params={"ids": ids_param}, headers=auth ) for user in resp.json()["users"]: process(user)

Cache responses

If you're calling the same endpoint for the same resource repeatedly, cache the response. Use the Cache-Control and ETag headers from the API response to determine how long the cache is valid and when to revalidate. Even a 60-second in-process cache on a high-frequency operation can reduce call volume by orders of magnitude.

# Respect Cache-Control and ETag for conditional GET requests resp = requests.get("https://api.example.com/v1/config", headers=auth) etag = resp.headers.get("ETag") # e.g. "abc123" cache_control = resp.headers.get("Cache-Control") # e.g. "max-age=300" # ... 300 seconds later, check if the resource changed resp = requests.get("https://api.example.com/v1/config", headers={**auth, "If-None-Match": etag} ) HTTP/2 304 Not Modified # 304 means the resource hasn't changed — use your cached version. # This counts against your rate limit but costs the server almost nothing, # and avoids transferring the full response body.

Add a client-side rate limiter

A client-side limiter prevents your code from ever sending more than N requests per second, regardless of how much traffic the rest of your application generates. This is proactive throttling: you enforce the limit yourself before the server has to. It's especially valuable in bulk-processing jobs where you know the total volume will exceed the quota.

# Simple token-bucket client-side limiter (Python) import time, threading class RateLimiter: """Allows max_calls per period seconds.""" def __init__(self, max_calls: int, period: float = 1.0): self.max_calls = max_calls self.period = period self.tokens = max_calls self.last_check = time.monotonic() self.lock = threading.Lock() def acquire(self): with self.lock: now = time.monotonic() elapsed = now - self.last_check # Refill tokens proportional to elapsed time self.tokens = min( self.max_calls, self.tokens + elapsed * (self.max_calls / self.period) ) self.last_check = now if self.tokens >= 1: self.tokens -= 1 return # proceed immediately # No token available — sleep until one refills sleep_for = (1 - self.tokens) * (self.period / self.max_calls) time.sleep(sleep_for) self.acquire() # retry # Usage: limit to 80 calls per minute (leaving headroom below the server's 100) limiter = RateLimiter(max_calls=80, period=60.0) for record_id in records_to_sync: limiter.acquire() # blocks if needed to stay under 80/min resp = requests.get(f"https://api.example.com/v1/records/{record_id}", headers=auth) process(resp.json())

When to request a quota increase

Sometimes the right answer is not to optimise your code — it's to ask the provider for a higher quota. Signs you've done the engineering work and genuinely need more:

When you contact the provider, give them specifics: current limit, current usage, required throughput, and what you've already done to reduce volume. Providers are more likely to grant increases to callers who demonstrate they've already optimised.

Putting it together: handling a 429 in production

  1. Receive the 429. Read Retry-After (preferred) or X-RateLimit-Reset. Calculate the wait time.
  2. Do not retry immediately. Log the 429 with the wait time, rate-limit headers, and the endpoint that was called. This log is essential for diagnosing which limit was hit.
  3. Wait with jitter. Sleep for Retry-After + random(0, 0.5 × Retry-After). The jitter prevents a stampede if multiple instances hit the limit at the same time.
  4. Retry with backoff. If the first retry also returns 429, double the wait. Cap the maximum wait at a sensible ceiling (60–120 s).
  5. After N retries, surface the error. Don't retry forever — eventually return the 429 error to the caller so they can handle it. Alert if 429s are sustained (more than X per minute for more than Y minutes) — that's a sign your traffic pattern has changed.
  6. Review the access patterns that generated the 429s. Were they from a bulk job? Add batching and a client-side limiter to that job. Were they from real-time user traffic? Consider caching upstream or request coalescing.
🎯 Interview angle

Rate-limit handling is a favourite system-design follow-up: "Your service is getting 429s from a third-party API — what do you do?" The complete answer covers four things: (1) read the headers to know the limit type; (2) implement backoff with jitter; (3) reduce call volume with batching and caching; (4) add a client-side limiter. Candidates who only say "add a retry with backoff" are giving a partial answer — the retry handles recovery; the other three prevent you from needing to recover. Link to rel-03 for the full rate-limiting algorithm discussion.

✅ Read X-RateLimit-Remaining on every response, not just 429s

Many well-designed clients only check rate-limit headers on 429 responses. A better pattern: read X-RateLimit-Remaining on every 200 response and slow down proactively when the remaining count gets low (e.g., below 10% of the limit). This prevents the 429 from happening in the first place and gives you a smoother request pattern.

Under the hood: how it actually works

The 429 headers are not guesses — the server derives them from its internal rate-limiting state. Understanding that derivation tells you exactly what each number means and how a correct client should read and act on it.

How the server computes remaining quota: token bucket and fixed window

Most APIs use one of two algorithms to track quota. The values in the response headers directly reflect the algorithm's state at the moment of your request.

Token bucket. A bucket holds up to capacity tokens. One token is consumed per request. Tokens refill continuously at a rate of capacity / window tokens per second. If the bucket is empty, the request is rejected with 429. This permits short bursts up to capacity while enforcing a steady-state throughput ceiling. The server computes X-RateLimit-Remaining as floor(current_tokens).

Fixed window. A counter resets to 0 at each window boundary (e.g., the start of each minute). Each request increments the counter. When counter == limit, subsequent requests in that window are rejected. X-RateLimit-Remaining = limit - counter. X-RateLimit-Reset is the Unix timestamp of the next window boundary. This is simpler but allows a burst of 2× the limit if requests straddle two windows (the last second of window N + the first second of window N+1).

Exact header semantics: where each value comes from

HeaderAlgorithm sourceWhat it encodes
X-RateLimit-Limit Configured capacity The maximum tokens (bucket) or requests per window (fixed) for this key/tier
X-RateLimit-Remaining Live counter/bucket level at request time How many more requests are safe in the current window without a 429
X-RateLimit-Reset Window boundary timestamp Unix seconds when the counter resets to 0 (fixed window) or bucket fills to capacity (token bucket)
Retry-After Derived: Reset − now() or explicit Seconds until the server will accept new requests again; may be an HTTP-date string instead

On a 429 the server still writes these headers from its internal state — so Remaining will be 0, and Reset/Retry-After tells you exactly how many seconds until the counter resets. On a 200 the same headers reflect remaining capacity after deducting the current request.

How a correct client reads Reset/Retry-After and backs off with jitter

The algorithm a well-behaved client should follow on every response:

# Pseudocode — run after every HTTP response
def handle_response(resp, attempt):
    remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
    limit     = int(resp.headers.get("X-RateLimit-Limit", 999))

    # Proactive slow-down on successful responses (before hitting 0)
    if resp.status_code == 200 and remaining < limit * 0.10:
        wait_until_reset(resp)   # or reduce request rate

    if resp.status_code != 429:
        return resp              # success path

    # 429: compute the mandatory floor from server hints
    retry_after = resp.headers.get("Retry-After")
    if retry_after:
        mandatory_floor = float(retry_after)
    else:
        reset_ts = int(resp.headers.get("X-RateLimit-Reset", 0))
        mandatory_floor = max(0, reset_ts - time.time())

    # Add exponential backoff on top of the mandatory floor
    base_backoff = min(60, 1 * (2 ** attempt))   # 1, 2, 4, 8... capped at 60 s
    wait_seconds = max(mandatory_floor, base_backoff)

    # Full jitter: randomise within [0, wait_seconds] to de-synchronise instances
    import random
    jitter = random.uniform(0, wait_seconds)
    time.sleep(jitter)
    return None   # signal caller to retry

The key points: Retry-After (when present) is the server's authoritative floor — never wait less. Exponential backoff adds a delay on top of that floor for repeated 429s. Jitter randomises across instances.

Worked trace: requests crossing the limit and client recovery

# API limit: 10 requests per minute (fixed window), resets at :00 of each minute # Scenario: sync job fires 12 requests in quick succession at 03:00:58 T=03:00:58.100 GET /v1/records/1 200 X-RateLimit-Limit:10 X-RateLimit-Remaining:9 X-RateLimit-Reset:1718388060 T=03:00:58.210 GET /v1/records/2 200 X-RateLimit-Remaining:8 X-RateLimit-Reset:1718388060 ... (requests 3–10 succeed, Remaining counts down 7→6→…→0) ... T=03:00:58.900 GET /v1/records/10 200 X-RateLimit-Remaining:0 X-RateLimit-Reset:1718388060 # Window is now exhausted. Two more requests in the burst: T=03:00:58.920 GET /v1/records/11 429 X-RateLimit-Limit:10 X-RateLimit-Remaining:0 X-RateLimit-Reset:1718388060 Retry-After:62 # Client reads Retry-After=62. mandatory_floor=62, base_backoff=1 (attempt 0) # wait = max(62, 1) = 62 s; jitter = random(0, 62) = e.g. 41 s # → sleep 41 seconds T=03:01:39.920 GET /v1/records/11 (retry, attempt 1) 200 X-RateLimit-Remaining:9 X-RateLimit-Reset:1718388120 ← new window # Window reset at 03:01:00 (Reset timestamp 1718388060 passed). Client resumes. T=03:01:40.000 GET /v1/records/12 200 X-RateLimit-Remaining:8 # Note: if TWO instances had hit the 429 simultaneously without jitter, # both would have waited exactly 62 s and fired at the same instant → # causing another burst at the window boundary. # Jitter (41 s vs, say, 53 s) spreads them out.
⚠️ X-RateLimit-Reset is a timestamp, not a duration — compute the difference yourself

X-RateLimit-Reset carries a Unix timestamp (seconds since epoch), not a countdown. If you store the raw value and compare it to time.time() you need to subtract. A common off-by-one bug: treating the header as seconds-to-wait and sleeping for 1,718,388,060 seconds (55,000+ years). Always compute wait = max(0, reset_timestamp - time.time()). Also note: some APIs send Retry-After as an HTTP-date string (Fri, 14 Jun 2024 03:02:00 GMT) rather than an integer — parse accordingly.

🧠 Quick check

1. You receive a 429 with Retry-After: 30. Your retry logic retries immediately because the request is "urgent." What happens?

The rate-limit window resets at the time indicated by Retry-After (30 seconds). An immediate retry is within the same window and will also be rejected. With multiple instances, you've sent N×2 requests into an already-rate-limited window. Always honour the Retry-After delay.

2. Why add jitter to the backoff wait time?

If 10 instances all hit the rate limit at the same time, they'll all sleep for the same calculated duration. When the sleep ends, all 10 fire simultaneously — creating a new burst at the rate-limit boundary. Jitter randomises each instance's wait time so they spread out naturally over the reset window.

3. Your sync job fetches 500 user profiles by calling GET /users/{id} in a loop. The API allows 100 requests per minute. The job takes 5+ minutes and hits 429s. What is the fastest fix that doesn't require contacting the provider?

A batch endpoint is the highest-leverage fix — if the API supports fetching 50 users in one call, you reduce 500 requests to 10, which is well under any per-minute limit. A per-request sleep (option B) still uses 500 requests but spreads them; it works but wastes capacity. Option C doesn't work — rate limits are typically per-API-key, so multiple servers sharing a key share the same limit.

4. X-RateLimit-Remaining: 3 appears in a 200 response. What should a well-designed client do?

Reading rate-limit headers on successful responses lets you act before hitting the limit rather than after. With only 3 requests remaining, the next burst of traffic will trigger a 429. Throttling proactively — sleeping briefly, queuing requests, or prioritising urgent calls — keeps you under the limit without the disruption of a 429 and retry cycle.

✍️ Exercise: design a rate-limit-aware data pipeline

You're building a nightly data-sync pipeline that fetches ~10,000 records from a third-party CRM API. The CRM limits you to 500 requests per minute and 50,000 requests per day. Each record requires one API call. Design the pipeline to avoid 429s and handle them gracefully if they occur anyway. Consider: batching, scheduling, client-side limiting, backoff, and quota tracking.

Model answer:

  1. Check for a batch endpoint. If the CRM API supports fetching 50 records per call, 10,000 records = 200 requests — well within both limits. Batching is the single most impactful change. Always check before building any other rate-limit machinery.
  2. Spread the job over time. Even without batching, 10,000 requests at 500/min takes 20 minutes. Schedule the job to start with enough runway — don't start at 23:40 if it takes 20 minutes; start at 22:00. This also leaves capacity for other processes using the same API key.
  3. Add a client-side limiter at 450/min (90% of the limit). Leave 10% headroom for other code paths and measurement uncertainty. The limiter ensures the pipeline never sends more than 450 requests per minute, regardless of how fast the job processes records.
  4. Track daily quota usage. With a 50,000/day limit, the pipeline consumes 10,000 (or 200, if batching). Track total daily usage in a counter (Redis, a database) and alert when usage exceeds 80% of the daily quota. This prevents other processes from exhausting the quota before the nightly job runs.
  5. Implement backoff for unexpected 429s. Even with proactive limiting, 429s can occur (e.g., the CRM temporarily lowers limits during maintenance). Implement exponential backoff with jitter (base 1 s, max 60 s, respecting Retry-After). Failed records should be queued for retry at the end of the run rather than blocking the main pipeline.

Rubric: ✓ Batching mentioned first ✓ Client-side limiter below 100% of the API limit ✓ Daily quota tracking ✓ Backoff with jitter ✓ Failed-request queue for retries (not blocking the pipeline).

Key takeaways

Sources & further reading