Reliability & Scale · Lesson 03
Rate Limiting Algorithms
Every public API eventually gets hammered — by a runaway client loop, a DDoS, or just unexpected success. Rate limiting is how you protect capacity and enforce fair use. The algorithm you pick shapes whether bursts are allowed, how much memory you spend, and how precisely the limit holds.
By the end you'll be able to
- Explain all five canonical rate limiting algorithms, their trade-offs, and when to use each.
- Design a distributed rate limiter using a shared store (Redis) that handles clock skew and atomic updates.
- Define the correct HTTP response contract for rate limit errors, including the headers clients must respect.
Why rate limiting is hard to get right
A rate limit sounds simple: "no more than 100 requests per minute." But the moment you ask "per minute starting when?" or "what if the same client fires 99 requests in the last millisecond of minute one and 99 more in the first millisecond of minute two?" you discover the problem is genuinely subtle. Different algorithms make different trade-offs between burst tolerance, memory cost, and accuracy. Picking the wrong one can mean either blocking legitimate clients or letting a determined abuser slip through.
Think of rate limiting like a nightclub bouncer. A simple bouncer counts tickets for the hour and stops at capacity — but two groups can each cram in at 11:59 and 12:00, filling the place twice over in two minutes. A smarter bouncer has a sliding memory of who came in over the last 60 minutes, not the current clock-aligned block. Even smarter: a token dispenser that gives out admission tokens at a steady drip, letting crowds absorb small bursts without overwhelming the dance floor.
Algorithm 1 — Fixed window counter
How it works
Divide time into fixed, non-overlapping buckets (e.g., "the minute 14:07:00–14:07:59"). Keep one counter per (client, window). On each request: compute the current window key, increment the counter, and reject if it exceeds the limit. At the start of a new window, the counter resets to zero.
The window key is typically floor(now / window_seconds). This is a single integer operation, which is why it's trivially cheap: one integer stored per client.
# Fixed window counter — pseudo-code
function is_allowed(client_id, limit, window_seconds):
now = current_unix_timestamp()
window_key = floor(now / window_seconds)
redis_key = "rl:" + client_id + ":" + window_key
count = INCR(redis_key) # atomic; returns new value
if count == 1:
EXPIRE(redis_key, window_seconds * 2) # clean up old keys
return count <= limit
Pros and cons
| Pro | Con |
|---|---|
| Trivially simple — one counter per window | Boundary burst: 2× effective limit at window edge |
| O(1) memory per client | Inaccurate at short time scales |
| One atomic INCR on Redis | Counter resets abruptly — sawtooth traffic |
| Easy to reason about and explain | Not suitable when precise enforcement is critical |
Best for: coarse-grained quota enforcement where a 2× burst at boundaries is acceptable — e.g., "no more than 10,000 API calls per day per account" for billing purposes.
Algorithm 2 — Sliding window log
How it works
Instead of a single counter, keep a log (sorted set) of every request's timestamp for each client. On every request, evict entries older than the window, count the remaining entries, and reject if over the limit. The window is always the last N seconds ending right now — it slides with real time.
# Sliding window log — pseudo-code (Redis sorted set)
function is_allowed(client_id, limit, window_ms):
now = current_unix_ms()
key = "rl:" + client_id
cutoff = now - window_ms
# Atomic Lua script (all-or-nothing, no race condition)
lua_script = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local cutoff = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local window = tonumber(ARGV[4])
redis.call('ZREMRANGEBYSCORE', key, '-inf', cutoff) -- evict old
local count = redis.call('ZCARD', key) -- count current
if count < limit then
redis.call('ZADD', key, now, now) -- log this request
redis.call('EXPIRE', key, window / 1000)
return 1 -- allowed
end
return 0 -- rejected
"""
return EVAL(lua_script, key, now, cutoff, limit, window_ms) == 1
Pros and cons
| Pro | Con |
|---|---|
| Perfectly accurate — no boundary burst possible | Memory grows with traffic: O(limit) entries per client |
| Works at any granularity | Eviction scan on every request (ZREMRANGEBYSCORE) |
| Easy to inspect (the log is readable) | For high-limit accounts (e.g., 100k/min) memory is prohibitive |
Best for: low-to-medium volume APIs where precision is critical — security-sensitive endpoints, payment flows — and per-client limits are not enormous.
Algorithm 3 — Sliding window counter
How it works
This is the sweet spot between fixed window (cheap but imprecise) and log (precise but heavy). Keep two fixed-window counters: the current window and the previous window. When evaluating a request at time t, compute a weighted estimate of how much of the previous window's traffic still falls inside the last N seconds:
estimated_count = prev_count × (1 − elapsed_fraction) + curr_count
If elapsed_fraction = 0.3 (30% through the current window), then 70% of the previous window's requests are still "recent." The approximation error is at most ~1/limit of the true rate — negligible for limits above ~10.
# Sliding window counter — pseudo-code
function is_allowed(client_id, limit, window_seconds):
now = current_unix_timestamp()
curr_window = floor(now / window_seconds)
prev_window = curr_window - 1
elapsed_fraction = (now % window_seconds) / window_seconds
curr_key = "rl:" + client_id + ":" + curr_window
prev_key = "rl:" + client_id + ":" + prev_window
curr_count = GET(curr_key) OR 0
prev_count = GET(prev_key) OR 0
# Weighted estimate: how many requests in the last window_seconds?
estimate = prev_count × (1.0 - elapsed_fraction) + curr_count
if estimate >= limit:
return REJECTED
INCR(curr_key)
if curr_count == 0:
EXPIRE(curr_key, window_seconds * 2)
return ALLOWED
Imagine a 60-second window. You're 18 seconds into the current window (30% elapsed). If the previous window had 80 requests, we assume 70% of them (≈56) still "count" toward the last 60 seconds. Add the current window's 20 requests: an estimated 76 out of a limit of 100. The error is bounded because the linear interpolation underestimates real traffic by at most limit/2 under pathological distributions — entirely acceptable for soft quotas.
Pros and cons
| Pro | Con |
|---|---|
| O(1) memory — just two counters per client | Approximation: ~0–3% error near window boundary |
| Eliminates most of the boundary burst problem | Slightly more complex to implement than fixed window |
| Works well for large per-client limits | Error grows if prev window traffic was extremely bursty |
Best for: general-purpose API rate limiting at scale. This is the algorithm used by most production API gateways because it balances accuracy and storage cost well. Cloudflare uses a variant of this approach.
Algorithm 4 — Token bucket
How it works
Imagine a physical bucket that holds up to B tokens. Tokens drip in at a constant refill rate r (tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. If you haven't made any requests, tokens accumulate — up to the bucket's maximum capacity — which means you can legitimately absorb a short burst whenever you've been idle.
The key insight: the bucket capacity B controls the maximum burst size; the refill rate r controls the sustained rate. They're independently configurable.
# Token bucket — pseudo-code (lazy refill, no background thread needed)
function is_allowed(client_id, capacity, refill_rate):
# Load persisted state {tokens, last_refill_time}
state = LOAD(client_id) OR { tokens: capacity, last_refill: now() }
now_ts = current_unix_timestamp()
elapsed = now_ts - state.last_refill
new_tokens = elapsed × refill_rate
# Refill up to capacity (lazy — computed at request time, not on a timer)
state.tokens = MIN(capacity, state.tokens + new_tokens)
state.last_refill = now_ts
if state.tokens < 1:
SAVE(client_id, state)
return REJECTED # bucket empty
state.tokens -= 1
SAVE(client_id, state)
return ALLOWED
Pros and cons
| Pro | Con |
|---|---|
| Naturally allows controlled bursts — a well-behaved client earns burst credit by being quiet | Two parameters to tune (capacity and refill rate) — can be confusing to expose to users |
| Smooth interaction with bursty legitimate traffic patterns | Requires storing {tokens, last_refill} atomically — slightly harder than a counter |
| Works well for traffic shaping at the individual request level | In distributed settings, floating-point rounding of tokens can drift |
Best for: APIs with naturally bursty clients — mobile apps, SDKs with retry logic, batch importers. The AWS API Gateway, Stripe, and many payment APIs use token-bucket semantics because clients doing legitimate bulk operations earn burst capacity through prior quiet periods.
Algorithm 5 — Leaky bucket
How it works
Flip the mental model: instead of tokens dripping in and requests consuming them, requests drip in and a drain processes them at a constant rate. Imagine a physical bucket with a hole at the bottom. Water (requests) pours in at any rate; it drips out at a fixed rate. If the bucket overflows (queue is full), excess requests are dropped. The key property: the output is always at a constant, predictable rate regardless of the input shape.
# Leaky bucket — pseudo-code (queue-based)
function handle_request(client_id, request, queue_capacity, drain_rate):
queue = GET_QUEUE(client_id) # in-memory or Redis list
if len(queue) >= queue_capacity:
return REJECTED # bucket overflowed
ENQUEUE(queue, request)
return QUEUED # will be drained at drain_rate
# Background drain loop (runs at drain_rate)
function drain_loop(client_id, drain_rate):
interval = 1.0 / drain_rate # seconds between each drain
while True:
request = DEQUEUE(GET_QUEUE(client_id))
if request:
FORWARD_TO_BACKEND(request)
SLEEP(interval)
Many engineers confuse "leaky bucket" with "token bucket" because both have a fixed rate. The difference: leaky bucket smooths output — it queues and drains at a constant rate, so downstream services see metered traffic even if clients burst. Token bucket passes bursts through immediately up to the bucket capacity. In network engineering the leaky bucket is used for traffic shaping (e.g., policing egress bandwidth); token bucket is used for admission control.
Pros and cons
| Pro | Con |
|---|---|
| Absolutely smooth output rate — protects downstream services from spikes | Adds latency: requests sit in the queue before being forwarded |
| Great for protecting fragile backends with low variance tolerance | Queue requires bounded memory; overflow silently drops requests |
| Conceptually simple drain model | Does not allow legitimate bursts — idle time is not "banked" |
Best for: network egress shaping or protecting downstream services (databases, third-party APIs) that have strict throughput limits and cannot tolerate any burstiness. Less common for public API rate limiting, more common at the infrastructure level.
Algorithm comparison
| Algorithm | Burst handling | Memory cost | Accuracy | Implementation complexity | Best use case |
|---|---|---|---|---|---|
| Fixed window counter | 2× burst at boundary | O(1) per client | Low near boundaries | Trivial | Coarse daily/monthly billing quotas |
| Sliding window log | No burst allowed beyond limit | O(limit) per client | Perfect | Moderate | Low-volume, high-precision (e.g. auth endpoints) |
| Sliding window counter | Minimal (<1% boundary burst) | O(1) per client (2 counters) | Very high (~99%+) | Moderate | General-purpose API rate limiting at scale |
| Token bucket | Allows bursts up to bucket size B | O(1) per client | Exact (token math) | Moderate | Bursty API clients; AWS/Stripe/mobile SDKs |
| Leaky bucket | No burst — queue drains at constant rate | O(queue_capacity) | N/A (rate shaper, not limiter) | Complex (requires drain loop) | Traffic shaping; protecting fragile downstream services |
Where to enforce rate limits
Rate limiting can live at several layers. Each layer has a different view of the traffic and different failure modes.
| Layer | What it sees | When to use | Blind spot |
|---|---|---|---|
| CDN / edge (Cloudflare, Fastly) | Raw IP before TLS termination | DDoS protection, bot mitigation | Shared IPs (NAT, offices) count as one client |
| API gateway (Kong, AWS API GW) | Authenticated client identity, API key | Per-key quotas, monetization tiers | Adds one network hop; single point of failure if not HA |
| Application server | Full request context (user ID, endpoint) | Fine-grained per-endpoint limits | Each instance enforces independently without shared store |
| Service mesh (Envoy, Istio) | Service-to-service traffic | Internal quotas between microservices | Adds complexity; overkill for simple APIs |
Most production systems stack layers: CDN for IP-based abuse, API gateway for authenticated quota enforcement, and application-level logic for fine-grained per-operation limits.
Distributed rate limiting
In a single-server deployment, an in-memory counter is sufficient. In a horizontally scaled fleet, every instance needs to see the same counter — otherwise "limit 100 req/min" becomes "limit 100 per node" and your actual limit is 100 × number of nodes. The canonical solution is a shared external store, almost universally Redis in practice.
Atomic operations: INCR and Lua scripts
Redis is single-threaded for command execution. A simple INCR is guaranteed atomic: two simultaneous calls from different nodes will both succeed, and the counter will reach exactly 2 — no lost increments. For operations that need to read and conditionally write (like token bucket's "check tokens, decrement if allowed"), a Lua script is atomic as a unit because Redis executes the entire script without yielding:
-- Redis Lua: atomic fixed-window check-and-increment
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local count = redis.call('INCR', key)
if count == 1 then
redis.call('EXPIRE', key, window) -- set TTL only on first write
end
if count > limit then
return 0 -- rejected
end
return 1 -- allowed
Clock skew
In a distributed fleet, different nodes have slightly different wall-clock readings — usually milliseconds to tens of milliseconds apart due to NTP drift. For fixed-window counters this means a request arriving at 14:07:59.998 on one node and 14:08:00.003 on another (due to network latency + clock drift) will be counted in different windows. The practical impact is small — it's essentially an additional source of the boundary-burst error — but it means you should not rely on rate limiters for exact second-level enforcement. If you need millisecond precision, use a centralized Redis with timestamps from the Redis server itself (TIME command) rather than trusting the application node's clock.
-- Use Redis server time, not application node time
local time = redis.call('TIME') -- {seconds, microseconds}
local now_ms = tonumber(time[1]) * 1000 + math.floor(tonumber(time[2]) / 1000)
The HTTP response contract for rate limits
When a request is rejected for exceeding the rate limit, the server must return HTTP 429 Too Many Requests. Simply returning 429 is not enough — clients need machine-readable information to implement proper backoff. The standard headers are:
| Header | Value | Purpose |
|---|---|---|
X-RateLimit-Limit | The maximum allowed (e.g., 100) | Tells the client what the ceiling is |
X-RateLimit-Remaining | Calls left in this window (0 on rejection) | Lets well-behaved clients self-throttle before hitting 429 |
X-RateLimit-Reset | Unix timestamp when the window resets | Client knows exactly when to retry; also send on every 2xx response |
Retry-After | Seconds to wait (or an HTTP-date) | IETF standard; clients must respect this for 429 and 503 |
Proactively include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every successful 2xx response. A well-built client SDK reads these on every response and proactively slows down when Remaining drops near zero, avoiding the 429 entirely. Headers only on the error response means clients have already burned the quota before they know about it.
Client behavior: the right way to handle 429
# Client-side rate limit handling — pseudo-code
function api_call(endpoint, max_retries=5):
for attempt in range(max_retries):
response = HTTP_GET(endpoint)
# On every response: track remaining quota
remaining = response.headers['X-RateLimit-Remaining']
reset_at = response.headers['X-RateLimit-Reset']
if response.status == 200:
return response.body
if response.status == 429:
retry_after = response.headers.get('Retry-After', 60)
wait_seconds = float(retry_after) + jitter() # add jitter!
LOG("Rate limited. Waiting " + wait_seconds + "s")
SLEEP(wait_seconds)
continue
raise UnexpectedError(response.status)
raise MaxRetriesExceeded()
When thousands of clients all receive 429 simultaneously and immediately retry, they arrive at the server as a synchronized wave — a thundering herd. Each retry triggers another 429, which triggers another immediate retry. Always add a wait (honoring Retry-After) plus jitter (a random additional delay) to desynchronize retries across the client population. Without jitter, exponential backoff still causes synchronized bursts at the backoff intervals.
This is one of the most common system design interview questions. A complete answer covers: (1) clarify requirements — per-user or per-IP? which algorithm? what's the limit? (2) choose an algorithm — justify token bucket or sliding window counter for most cases; (3) distributed implementation — Redis with atomic Lua for shared state; (4) where to place it — gateway layer, not application; (5) the response contract — HTTP 429, Retry-After, X-RateLimit-* headers; (6) failure mode — what happens if Redis is down? (fail open vs. fail closed). Interviewers are looking for all six. Most candidates only cover the algorithm part and miss the distributed and failure-mode angles.
How to debug & inspect it
Rate limit bugs come in two flavours: clients hitting 429 they did not expect, and limits that are not enforcing at all (clients sending far more than the limit without being stopped). Both are diagnosable with a handful of curl commands and a careful read of the response headers.
Read the headers on every response
A well-behaved API sends X-RateLimit-* on every 2xx, not just on 429. The fastest diagnostic is to add -I or -v and read the state of the window before you hit the wall:
Reproduce a 429 with a tight request loop
When debugging whether a client is the source of its own 429s, fire a controlled burst and read the Retry-After value exactly as a client would:
Diagnose a distributed-counter race
If your API fleet uses a shared Redis counter but clients consistently hit more than the stated limit, the likely cause is a non-atomic read-modify-write instead of an atomic Lua script or INCR:
Diagnose clock skew
In a distributed fleet, windows computed from wall-clock time on different nodes will not align perfectly. The symptom is a fixed-window counter that resets a few milliseconds earlier on some nodes, allowing a small extra burst. Verify by comparing the Redis TIME against the application node's clock:
Worked numeric token-bucket trace
Capacity B = 10 tokens, refill rate r = 2 tokens/second. Starting state: full bucket (10 tokens). Five events over 6 seconds:
| t (s) | Event | Elapsed since last | Tokens added (r × elapsed) | Tokens before consume | Tokens after | Result |
|---|---|---|---|---|---|---|
| 0.0 | Request A | — | 0 | 10.0 | 9.0 | Allowed |
| 0.2 | Request B | 0.2 s | 0.4 | 9.4 | 8.4 | Allowed |
| 0.3 | Burst: 9 reqs | 0.1 s | 0.2 | 8.6 | 0 (−8.6, rounds to 0) | 8 allowed, 1 rejected (bucket empty) |
| 2.8 | Request C | 2.5 s | 5.0 | 5.0 | 4.0 | Allowed (idle time refilled 5 tokens) |
| 5.8 | Request D | 3.0 s | 6.0 | min(10, 4+6)=10 | 9.0 | Allowed (capped at capacity) |
Key observations: the burst at t=0.3 drained the bucket; the 2.5-second idle at t=2.8 earned 5 tokens back; capping at capacity means idle time never accumulates past the burst maximum, so a client cannot "save up" an unbounded burst by staying quiet for hours.
Symptom → cause → fix
| Symptom | Likely cause | Fix |
|---|---|---|
| Client hits 429 far below the stated limit | Per-node counter not shared — each node enforces the limit independently, so the effective limit is limit / nodes | Switch to shared Redis counter with atomic Lua script |
| Limit seems to reset twice per window (e.g. at :00 and :30 in a 60-second window) | Clock skew between nodes: different nodes compute different window keys for the same real time | Use Redis TIME inside the Lua script for all window-key calculations |
429 headers show Retry-After: 0 or a past timestamp | Reset timestamp computed on application node whose clock is behind Redis/truth | Derive reset time from Redis TIME + window length |
| Limit not enforced at all — unlimited traffic passes | Redis is down and the code path fails open (returns "allowed" on error) | Decide on fail-open vs. fail-closed policy; log and alert on Redis errors regardless |
| Well-behaved client occasionally sees 429 spikes without sending a burst | Fixed-window boundary burst from a different client sharing the same Redis key (key is too coarse — e.g. by IP behind NAT) | Scope the key more narrowly (API key, user ID); or switch to sliding window counter to eliminate boundary bursts |
- On every 429, read
Retry-Afterand wait that many seconds plus jitter before retrying — never retry immediately. - Add the rate-limit headers (
X-RateLimit-Remaining,X-RateLimit-Reset) to your client wrapper so your code can self-throttle before hitting 429. - In staging, force-expire the Redis key and re-run the request to confirm the counter starts fresh:
redis-cli DEL "rl:user_42:28333". - Verify atomicity: grep your Lua script or check that you use
INCR(not GET + SET) — a non-atomic read-modify-write will under-count under concurrency. - Check that your key includes both the client identifier and the window key — a key that omits the window component never resets.
In production: how leading APIs do it
Rate limiting is not a configuration toggle — it is an architectural decision that shapes how clients build integrations and how support teams debug outages. Five platforms that handle billions of API calls daily each made different algorithm choices, and those choices are reflected in the headers they return and the client behaviours they encourage.
| Company | Algorithm | Typical limit | Client signal |
|---|---|---|---|
| Stripe | Token bucket per account, backed by Redis; FOUR cooperating limiters in parallel (see deep-dive below) | ~100 read req/s and ~100 write req/s for live-mode keys | 429 + Retry-After (seconds until the bucket refills enough for one request) |
| Shopify | Leaky bucket for REST (bucket of ~40 requests leaking at ~2 req/s); calculated query-cost limiter for GraphQL (each field has a cost weight; queries that exceed the remaining bucket cost are rejected) | REST: 40-unit bucket refilling at 2/s; GraphQL: 1,000-point bucket refilling at 50 points/s | X-Shopify-Shop-Api-Call-Limit (e.g. 32/40) on REST; X-GraphQL-Cost-Include-Fields and throttle extensions in the GraphQL response body |
| GitHub | Fixed window (hourly) for primary limits; separate secondary/abuse heuristics for burst detection | 5,000 req/hour for authenticated REST; secondary limits triggered by patterns (many writes in a short window, many concurrent requests) | X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used on every response; Retry-After on secondary-limit 429 |
| AWS API Gateway | Token bucket (steady-state rate + burst capacity) at the stage and method level; usage plans layer per-key quotas (daily/monthly fixed windows) on top | Default: 10,000 req/s steady-state, 5,000 req burst; configurable per usage plan | 429 Too Many Requests; no standard rate-limit headers by default (clients must implement exponential backoff; Lambda integrations can add headers manually) |
| HubSpot | Rolling per-account window (10-second bucket) | ~100–190 req/10s depending on subscription tier; daily cap of 250,000–1,000,000 requests | X-HubSpot-RateLimit-Remaining, X-HubSpot-RateLimit-Reset, X-HubSpot-RateLimit-Daily, X-HubSpot-RateLimit-Daily-Remaining |
Deep-dive: Stripe's four cooperating limiters
Stripe's publicly documented architecture runs not one but four rate limiters simultaneously on every inbound request. Each limiter targets a different failure mode, and a request is rejected if any single one of them trips.
The first is a request-rate limiter: a token bucket per account key that enforces the per-second ceiling most developers think of as "the" rate limit. The second is a concurrent-request limiter: it counts in-flight requests for a given key at any instant, regardless of rate. This catches the pattern where a client opens hundreds of connections simultaneously — each individually slow, but collectively exhausting connection pool capacity. The third is a fleet usage load shedder: when the overall fleet utilisation crosses a threshold, a portion of non-critical requests is shed to protect capacity for write operations and webhook delivery. This is not per-account — it is a global signal that fires during traffic spikes or partial outages. The fourth is a worker utilisation load shedder: individual compute workers monitor their own CPU and memory saturation and shed requests locally before they queue deeply. The practical consequence for integrators is that a well-behaved account observing the published req/s limits can still see a 429 during a platform-wide traffic event — and Retry-After is the correct response in all four cases.
This layered approach is worth understanding when designing your own API: a single rate limit rarely covers all the ways a client can overwhelm a system. Concurrent connections, burst depth, and fleet-level health are distinct axes.
By the numbers: what Redis stores, the write, and the delay
Make it concrete. The limiter is capacity B = 100 tokens, refill r = 50 tokens/second (so a sustained 50 req/s, with bursts up to 100). Per account, Redis stores just two fields in a hash rl:{account}:
Every request runs this read → compute → write as ONE atomic Lua script (atomic so two app servers can't both spend the last token):
The key formula is the delay a throttled client must wait. When the bucket holds t tokens (0 ≤ t < 1), the next token arrives in (1 − t) / r seconds, so:
Trace a burst that drains the bucket — B = 100, r = 50/s, starting full, a client fires 130 requests at t = 0 then keeps polling:
| t (s) | Event | Tokens before | Decision | Tokens after | Client told |
|---|---|---|---|---|---|
| 0.000 | reqs 1–100 (burst) | 100 → 1 | all 100 ALLOW | 0 | 200 OK |
| 0.000 | reqs 101–130 | ~0 | 30 DENY | 0 | 429, Retry-After: 1 |
| 0.020 | 1 retry (too early) | 0.02·50 = 1.0 | 1 ALLOW | 0 | 200 OK |
| 1.000 | steady polling | refilled to 50 over 1 s | ALLOW ≤ 50/s | drains to 0 | 200 then 429 |
The decision math (sizing the bucket): capacity B is the maximum burst you tolerate; refill r is the sustained rate you guarantee. To let a client briefly burst to 2× their steady rate for up to 2 seconds, set B = r + (2r − r)·2 = 3r. And the worst-case extra load a full fleet of C clients can dump in one instant is C × B requests — size your backend so a synchronized burst (C × B) doesn't tip it over, or add the concurrency limiter and load shedders from the Stripe section below.
🧠 Quick check
1. A client sends 100 requests in the last 2 ms of minute 14:07 and 100 more in the first 2 ms of minute 14:08. Both sets are accepted. Which algorithm is in use?
This is the classic boundary burst problem of the fixed window counter. The window resets at the minute boundary, so both sets of 100 requests start fresh counters. Sliding window log would have caught this. Token bucket only passes bursts if there are tokens saved, not due to clock boundaries.
2. You need to protect a payment API endpoint. Accuracy is critical — you cannot allow even a 2× burst. Memory cost is not a concern (very few premium clients). Which algorithm fits best?
Sliding window log is the only algorithm with perfect accuracy — it stores every request's timestamp and evicts based on a real sliding window. For a low-volume, high-stakes endpoint like payment processing, the memory overhead is acceptable and precision is worth it.
3. In a distributed three-node API fleet, each node uses its own in-memory counter with a limit of 100 requests/min. What effective limit does a client experience?
Without a shared store, each node tracks the counter independently. If the load balancer distributes evenly, the client can send 100 req/min to each of the 3 nodes for a de-facto 300 req/min limit — 3× the intended cap. The load balancer does not enforce application-level quotas; it only routes TCP connections.
4. A server responds with HTTP 429 and a Retry-After: 30 header. What must a correctly implemented client do?
Retry-After is an instruction to the client: do not retry before this many seconds have elapsed. Immediately retrying a 429 is the textbook thundering herd failure mode. Jitter (random additional delay) prevents synchronized retries across the client population.
🏗️ Exercise 1 — Classic interview: design a rate limiter for an API (try before reading)
Model answer:
- Choose the algorithm: Token bucket with capacity 200 (allows the stated burst) and refill rate 1000/60 ≈ 16.7 tokens/sec. This handles bursty clients and is configurable per tier.
- Shared store: Redis for the token counter. Use a Lua script that atomically reads {tokens, last_refill}, computes the refill based on elapsed time, checks if a token is available, and writes back. A single Lua script is an atomic unit in Redis — no race conditions.
- Key schema:
rl:{api_key}→ hash with fieldstokens(float) andlast_refill(Unix ms). TTL set to 2× the refill-to-full time. - Where: At the API gateway layer, before requests reach application servers, so upstream services are protected and the limiter is a single enforcement point. Run the gateway in HA (multiple instances with health checks).
- Response contract: 429 + Retry-After + X-RateLimit-* on every response including 2xx.
- Redis failure mode: Decision point — fail open (allow traffic if Redis is unreachable, losing rate limit protection temporarily) or fail closed (return 503, protecting the API at the cost of availability). For a public monetized API, fail closed is safer.
- Observability: Emit metrics for 429 rate by API key and endpoint; alert if a single key accounts for >10% of all traffic; log rejected requests with client info for abuse investigation.
Rubric: ✓ Algorithm choice with justification ✓ Distributed shared store ✓ Atomic operation (no race condition) ✓ Gateway placement ✓ HTTP response contract ✓ Redis failure mode decision ✓ Observability hook. Hitting 5+ = strong answer.
🔍 Exercise 2 — Pick the right algorithm for a scenario
- A data export API allows customers to export their entire dataset. Jobs are CPU-intensive. You want to guarantee the backend is never hit with more than 5 concurrent export jobs per second regardless of how customers batch requests.
- A login endpoint must allow no more than 10 failed attempts per user per 15 minutes. Security matters more than performance; the limit must be exact.
- A mobile SDK calls a recommendations API. Users send bursts when they open the app (10 requests at once) then nothing for minutes. You want to allow those opening bursts but limit sustained throughput to 60 req/min.
- A billing API counts API calls for invoicing purposes. You need a counter per customer per calendar month. A small boundary burst is acceptable; you need minimal memory.
Model answers:
- Leaky bucket. You need smoothed output — no matter how requests arrive, the backend sees a constant 5/sec drain rate. Bursts are queued and forwarded at the controlled rate; overflow is dropped with 429.
- Sliding window log. Security demands exact enforcement — the fixed window's boundary burst would allow 20 attempts in near-zero time (the classic account lockout bypass). Sliding window log counts precisely and the limit is low (10), so O(limit) memory is fine.
- Token bucket. The 10-request opening burst earns token credit during idle time between app opens. The refill rate caps sustained throughput at 60/min (1 token/sec) while the bucket capacity of 10 allows the opening burst to drain immediately.
- Fixed window counter. Monthly billing is the canonical use case for fixed windows — the billing period is itself a fixed window. A small boundary burst is explicitly acceptable, and O(1) memory per customer keeps costs down at scale.
Rubric: Each answer should name the algorithm and give one concrete reason tied to the scenario's constraints. Answers that justify by elimination (e.g., "sliding window log is too memory-heavy for monthly billing") are equally valid.
Key takeaways
- Fixed window counter is cheapest but allows a 2× boundary burst — fine for coarse billing quotas, not for security-sensitive limits.
- Sliding window log is perfectly accurate but stores O(limit) entries per client — best for low-volume, high-stakes endpoints.
- Sliding window counter (weighted approximation of two fixed windows) gives near-perfect accuracy at O(1) memory — the general-purpose default for most APIs.
- Token bucket allows controlled bursts up to the bucket capacity — ideal for bursty clients that earn burst credit through idle time.
- Leaky bucket smooths output to a constant rate — it's a traffic shaper, not just an admission controller; adds latency via the queue.
- In distributed systems, use a shared Redis store with atomic Lua scripts to avoid per-node over-counting. Use Redis server time to avoid clock skew.
- Always return HTTP 429 with
Retry-After,X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Reset— on every response, not just on rejection. - Clients must honor
Retry-Afterand add jitter to prevent the thundering-herd retry storm.
Sources & further reading
- Cloudflare — Rate Limiting Rules documentation
- Stripe — API Rate Limits (token bucket semantics, headers)
- MDN — HTTP 429 Too Many Requests
- RFC 6585 §4 — Additional HTTP Status Codes: 429 Too Many Requests
- Redis — Lua scripting for atomic operations
- AWS Builders' Library — Using load shedding to avoid overload