API Design

Reliability & Scale · Lesson 03

Rate Limiting Algorithms

Every public API eventually gets hammered — by a runaway client loop, a DDoS, or just unexpected success. Rate limiting is how you protect capacity and enforce fair use. The algorithm you pick shapes whether bursts are allowed, how much memory you spend, and how precisely the limit holds.

⏱ 22 min Difficulty: advanced Prereq: HTTP basics, idempotency

By the end you'll be able to

Why rate limiting is hard to get right

A rate limit sounds simple: "no more than 100 requests per minute." But the moment you ask "per minute starting when?" or "what if the same client fires 99 requests in the last millisecond of minute one and 99 more in the first millisecond of minute two?" you discover the problem is genuinely subtle. Different algorithms make different trade-offs between burst tolerance, memory cost, and accuracy. Picking the wrong one can mean either blocking legitimate clients or letting a determined abuser slip through.

Think of rate limiting like a nightclub bouncer. A simple bouncer counts tickets for the hour and stops at capacity — but two groups can each cram in at 11:59 and 12:00, filling the place twice over in two minutes. A smarter bouncer has a sliding memory of who came in over the last 60 minutes, not the current clock-aligned block. Even smarter: a token dispenser that gives out admission tokens at a steady drip, letting crowds absorb small bursts without overwhelming the dance floor.

Algorithm 1 — Fixed window counter

How it works

Divide time into fixed, non-overlapping buckets (e.g., "the minute 14:07:00–14:07:59"). Keep one counter per (client, window). On each request: compute the current window key, increment the counter, and reject if it exceeds the limit. At the start of a new window, the counter resets to zero.

The window key is typically floor(now / window_seconds). This is a single integer operation, which is why it's trivially cheap: one integer stored per client.

# Fixed window counter — pseudo-code
function is_allowed(client_id, limit, window_seconds):
  now         = current_unix_timestamp()
  window_key  = floor(now / window_seconds)
  redis_key   = "rl:" + client_id + ":" + window_key

  count = INCR(redis_key)          # atomic; returns new value
  if count == 1:
    EXPIRE(redis_key, window_seconds * 2)  # clean up old keys

  return count <= limit
Window 1 (00:00 – 00:59) counter = 100 ✓ Window 2 (01:00 – 01:59) counter = 100 ✓ window boundary 100 req ↑ last ms 100 req ↑ first ms ⚠ 200 requests pass in <2 ms — the boundary burst problem
Fig 1 — The fixed-window boundary burst: a client can legally send 2× the intended limit in near-zero time by straddling the window edge.

Pros and cons

ProCon
Trivially simple — one counter per windowBoundary burst: 2× effective limit at window edge
O(1) memory per clientInaccurate at short time scales
One atomic INCR on RedisCounter resets abruptly — sawtooth traffic
Easy to reason about and explainNot suitable when precise enforcement is critical

Best for: coarse-grained quota enforcement where a 2× burst at boundaries is acceptable — e.g., "no more than 10,000 API calls per day per account" for billing purposes.

Algorithm 2 — Sliding window log

How it works

Instead of a single counter, keep a log (sorted set) of every request's timestamp for each client. On every request, evict entries older than the window, count the remaining entries, and reject if over the limit. The window is always the last N seconds ending right now — it slides with real time.

# Sliding window log — pseudo-code (Redis sorted set)
function is_allowed(client_id, limit, window_ms):
  now      = current_unix_ms()
  key      = "rl:" + client_id
  cutoff   = now - window_ms

  # Atomic Lua script (all-or-nothing, no race condition)
  lua_script = """
    local key     = KEYS[1]
    local now     = tonumber(ARGV[1])
    local cutoff  = tonumber(ARGV[2])
    local limit   = tonumber(ARGV[3])
    local window  = tonumber(ARGV[4])

    redis.call('ZREMRANGEBYSCORE', key, '-inf', cutoff)  -- evict old
    local count = redis.call('ZCARD', key)               -- count current
    if count < limit then
      redis.call('ZADD', key, now, now)                  -- log this request
      redis.call('EXPIRE', key, window / 1000)
      return 1   -- allowed
    end
    return 0     -- rejected
  """

  return EVAL(lua_script, key, now, cutoff, limit, window_ms) == 1
active window — last 60 s now evicted (stale) Each ● = a timestamped entry in the sorted set ZREMRANGEBYSCORE purges entries left of the window boundary on every request
Fig 2 — Sliding window log. The window is anchored to "now" and rolls forward continuously. No boundary-burst problem exists, but every request stores a timestamp entry in memory.

Pros and cons

ProCon
Perfectly accurate — no boundary burst possibleMemory grows with traffic: O(limit) entries per client
Works at any granularityEviction scan on every request (ZREMRANGEBYSCORE)
Easy to inspect (the log is readable)For high-limit accounts (e.g., 100k/min) memory is prohibitive

Best for: low-to-medium volume APIs where precision is critical — security-sensitive endpoints, payment flows — and per-client limits are not enormous.

Algorithm 3 — Sliding window counter

How it works

This is the sweet spot between fixed window (cheap but imprecise) and log (precise but heavy). Keep two fixed-window counters: the current window and the previous window. When evaluating a request at time t, compute a weighted estimate of how much of the previous window's traffic still falls inside the last N seconds:

estimated_count = prev_count × (1 − elapsed_fraction) + curr_count

If elapsed_fraction = 0.3 (30% through the current window), then 70% of the previous window's requests are still "recent." The approximation error is at most ~1/limit of the true rate — negligible for limits above ~10.

# Sliding window counter — pseudo-code
function is_allowed(client_id, limit, window_seconds):
  now              = current_unix_timestamp()
  curr_window      = floor(now / window_seconds)
  prev_window      = curr_window - 1
  elapsed_fraction = (now % window_seconds) / window_seconds

  curr_key = "rl:" + client_id + ":" + curr_window
  prev_key = "rl:" + client_id + ":" + prev_window

  curr_count = GET(curr_key) OR 0
  prev_count = GET(prev_key) OR 0

  # Weighted estimate: how many requests in the last window_seconds?
  estimate = prev_count × (1.0 - elapsed_fraction) + curr_count

  if estimate >= limit:
    return REJECTED

  INCR(curr_key)
  if curr_count == 0:
    EXPIRE(curr_key, window_seconds * 2)
  return ALLOWED
✅ Why the weighted formula works

Imagine a 60-second window. You're 18 seconds into the current window (30% elapsed). If the previous window had 80 requests, we assume 70% of them (≈56) still "count" toward the last 60 seconds. Add the current window's 20 requests: an estimated 76 out of a limit of 100. The error is bounded because the linear interpolation underestimates real traffic by at most limit/2 under pathological distributions — entirely acceptable for soft quotas.

Pros and cons

ProCon
O(1) memory — just two counters per clientApproximation: ~0–3% error near window boundary
Eliminates most of the boundary burst problemSlightly more complex to implement than fixed window
Works well for large per-client limitsError grows if prev window traffic was extremely bursty

Best for: general-purpose API rate limiting at scale. This is the algorithm used by most production API gateways because it balances accuracy and storage cost well. Cloudflare uses a variant of this approach.

Algorithm 4 — Token bucket

How it works

Imagine a physical bucket that holds up to B tokens. Tokens drip in at a constant refill rate r (tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. If you haven't made any requests, tokens accumulate — up to the bucket's maximum capacity — which means you can legitimately absorb a short burst whenever you've been idle.

The key insight: the bucket capacity B controls the maximum burst size; the refill rate r controls the sustained rate. They're independently configurable.

# Token bucket — pseudo-code (lazy refill, no background thread needed)
function is_allowed(client_id, capacity, refill_rate):
  # Load persisted state {tokens, last_refill_time}
  state = LOAD(client_id) OR { tokens: capacity, last_refill: now() }

  now_ts       = current_unix_timestamp()
  elapsed      = now_ts - state.last_refill
  new_tokens   = elapsed × refill_rate

  # Refill up to capacity (lazy — computed at request time, not on a timer)
  state.tokens = MIN(capacity, state.tokens + new_tokens)
  state.last_refill = now_ts

  if state.tokens < 1:
    SAVE(client_id, state)
    return REJECTED  # bucket empty

  state.tokens -= 1
  SAVE(client_id, state)
  return ALLOWED
Token Bucket (capacity B) tokens = 4 of 8 +r tokens/sec request −1 token allowed ✓ bucket empty → 429 idle time = stored burst
Fig 3 — Token bucket. Idle time accumulates tokens; burst requests drain them. The refill rate r caps sustained throughput; the bucket size B caps burst size.

Pros and cons

ProCon
Naturally allows controlled bursts — a well-behaved client earns burst credit by being quietTwo parameters to tune (capacity and refill rate) — can be confusing to expose to users
Smooth interaction with bursty legitimate traffic patternsRequires storing {tokens, last_refill} atomically — slightly harder than a counter
Works well for traffic shaping at the individual request levelIn distributed settings, floating-point rounding of tokens can drift

Best for: APIs with naturally bursty clients — mobile apps, SDKs with retry logic, batch importers. The AWS API Gateway, Stripe, and many payment APIs use token-bucket semantics because clients doing legitimate bulk operations earn burst capacity through prior quiet periods.

Algorithm 5 — Leaky bucket

How it works

Flip the mental model: instead of tokens dripping in and requests consuming them, requests drip in and a drain processes them at a constant rate. Imagine a physical bucket with a hole at the bottom. Water (requests) pours in at any rate; it drips out at a fixed rate. If the bucket overflows (queue is full), excess requests are dropped. The key property: the output is always at a constant, predictable rate regardless of the input shape.

# Leaky bucket — pseudo-code (queue-based)
function handle_request(client_id, request, queue_capacity, drain_rate):
  queue = GET_QUEUE(client_id)  # in-memory or Redis list

  if len(queue) >= queue_capacity:
    return REJECTED  # bucket overflowed

  ENQUEUE(queue, request)
  return QUEUED  # will be drained at drain_rate

# Background drain loop (runs at drain_rate)
function drain_loop(client_id, drain_rate):
  interval = 1.0 / drain_rate  # seconds between each drain
  while True:
    request = DEQUEUE(GET_QUEUE(client_id))
    if request:
      FORWARD_TO_BACKEND(request)
    SLEEP(interval)
⚠️ Leaky bucket is not quite what you expect

Many engineers confuse "leaky bucket" with "token bucket" because both have a fixed rate. The difference: leaky bucket smooths output — it queues and drains at a constant rate, so downstream services see metered traffic even if clients burst. Token bucket passes bursts through immediately up to the bucket capacity. In network engineering the leaky bucket is used for traffic shaping (e.g., policing egress bandwidth); token bucket is used for admission control.

Pros and cons

ProCon
Absolutely smooth output rate — protects downstream services from spikesAdds latency: requests sit in the queue before being forwarded
Great for protecting fragile backends with low variance toleranceQueue requires bounded memory; overflow silently drops requests
Conceptually simple drain modelDoes not allow legitimate bursts — idle time is not "banked"

Best for: network egress shaping or protecting downstream services (databases, third-party APIs) that have strict throughput limits and cannot tolerate any burstiness. Less common for public API rate limiting, more common at the infrastructure level.

Algorithm comparison

Algorithm Burst handling Memory cost Accuracy Implementation complexity Best use case
Fixed window counter 2× burst at boundary O(1) per client Low near boundaries Trivial Coarse daily/monthly billing quotas
Sliding window log No burst allowed beyond limit O(limit) per client Perfect Moderate Low-volume, high-precision (e.g. auth endpoints)
Sliding window counter Minimal (<1% boundary burst) O(1) per client (2 counters) Very high (~99%+) Moderate General-purpose API rate limiting at scale
Token bucket Allows bursts up to bucket size B O(1) per client Exact (token math) Moderate Bursty API clients; AWS/Stripe/mobile SDKs
Leaky bucket No burst — queue drains at constant rate O(queue_capacity) N/A (rate shaper, not limiter) Complex (requires drain loop) Traffic shaping; protecting fragile downstream services

Where to enforce rate limits

Rate limiting can live at several layers. Each layer has a different view of the traffic and different failure modes.

LayerWhat it seesWhen to useBlind spot
CDN / edge (Cloudflare, Fastly)Raw IP before TLS terminationDDoS protection, bot mitigationShared IPs (NAT, offices) count as one client
API gateway (Kong, AWS API GW)Authenticated client identity, API keyPer-key quotas, monetization tiersAdds one network hop; single point of failure if not HA
Application serverFull request context (user ID, endpoint)Fine-grained per-endpoint limitsEach instance enforces independently without shared store
Service mesh (Envoy, Istio)Service-to-service trafficInternal quotas between microservicesAdds complexity; overkill for simple APIs

Most production systems stack layers: CDN for IP-based abuse, API gateway for authenticated quota enforcement, and application-level logic for fine-grained per-operation limits.

Distributed rate limiting

In a single-server deployment, an in-memory counter is sufficient. In a horizontally scaled fleet, every instance needs to see the same counter — otherwise "limit 100 req/min" becomes "limit 100 per node" and your actual limit is 100 × number of nodes. The canonical solution is a shared external store, almost universally Redis in practice.

Atomic operations: INCR and Lua scripts

Redis is single-threaded for command execution. A simple INCR is guaranteed atomic: two simultaneous calls from different nodes will both succeed, and the counter will reach exactly 2 — no lost increments. For operations that need to read and conditionally write (like token bucket's "check tokens, decrement if allowed"), a Lua script is atomic as a unit because Redis executes the entire script without yielding:

-- Redis Lua: atomic fixed-window check-and-increment
local key     = KEYS[1]
local limit   = tonumber(ARGV[1])
local window  = tonumber(ARGV[2])

local count   = redis.call('INCR', key)
if count == 1 then
  redis.call('EXPIRE', key, window)  -- set TTL only on first write
end
if count > limit then
  return 0   -- rejected
end
return 1     -- allowed

Clock skew

In a distributed fleet, different nodes have slightly different wall-clock readings — usually milliseconds to tens of milliseconds apart due to NTP drift. For fixed-window counters this means a request arriving at 14:07:59.998 on one node and 14:08:00.003 on another (due to network latency + clock drift) will be counted in different windows. The practical impact is small — it's essentially an additional source of the boundary-burst error — but it means you should not rely on rate limiters for exact second-level enforcement. If you need millisecond precision, use a centralized Redis with timestamps from the Redis server itself (TIME command) rather than trusting the application node's clock.

-- Use Redis server time, not application node time
local time   = redis.call('TIME')         -- {seconds, microseconds}
local now_ms = tonumber(time[1]) * 1000 + math.floor(tonumber(time[2]) / 1000)
Clients Load balancer API Node 1 API Node 2 API Node 3 Redis shared counter All nodes share the same counter — atomic INCR prevents double-counting
Fig 4 — Distributed rate limiting. Without a shared store, each API node enforces the limit independently, multiplying it by the fleet size. Redis atomic operations collapse the view to a single consistent counter.

The HTTP response contract for rate limits

When a request is rejected for exceeding the rate limit, the server must return HTTP 429 Too Many Requests. Simply returning 429 is not enough — clients need machine-readable information to implement proper backoff. The standard headers are:

HTTP/1.1 429 Too Many Requests Content-Type: application/json X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1718880060 Retry-After: 23 {"error": "rate_limit_exceeded", "message": "100 requests per minute. Retry after 23 seconds."}
HeaderValuePurpose
X-RateLimit-LimitThe maximum allowed (e.g., 100)Tells the client what the ceiling is
X-RateLimit-RemainingCalls left in this window (0 on rejection)Lets well-behaved clients self-throttle before hitting 429
X-RateLimit-ResetUnix timestamp when the window resetsClient knows exactly when to retry; also send on every 2xx response
Retry-AfterSeconds to wait (or an HTTP-date)IETF standard; clients must respect this for 429 and 503
✅ Send rate limit headers on every response, not just 429

Proactively include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every successful 2xx response. A well-built client SDK reads these on every response and proactively slows down when Remaining drops near zero, avoiding the 429 entirely. Headers only on the error response means clients have already burned the quota before they know about it.

Client behavior: the right way to handle 429

# Client-side rate limit handling — pseudo-code
function api_call(endpoint, max_retries=5):
  for attempt in range(max_retries):
    response = HTTP_GET(endpoint)

    # On every response: track remaining quota
    remaining = response.headers['X-RateLimit-Remaining']
    reset_at  = response.headers['X-RateLimit-Reset']

    if response.status == 200:
      return response.body

    if response.status == 429:
      retry_after = response.headers.get('Retry-After', 60)
      wait_seconds = float(retry_after) + jitter()  # add jitter!
      LOG("Rate limited. Waiting " + wait_seconds + "s")
      SLEEP(wait_seconds)
      continue

    raise UnexpectedError(response.status)

  raise MaxRetriesExceeded()
⚠️ Never retry 429 without a wait — that's the thundering herd problem

When thousands of clients all receive 429 simultaneously and immediately retry, they arrive at the server as a synchronized wave — a thundering herd. Each retry triggers another 429, which triggers another immediate retry. Always add a wait (honoring Retry-After) plus jitter (a random additional delay) to desynchronize retries across the client population. Without jitter, exponential backoff still causes synchronized bursts at the backoff intervals.

🎯 Interview angle — "Design a rate limiter"

This is one of the most common system design interview questions. A complete answer covers: (1) clarify requirements — per-user or per-IP? which algorithm? what's the limit? (2) choose an algorithm — justify token bucket or sliding window counter for most cases; (3) distributed implementation — Redis with atomic Lua for shared state; (4) where to place it — gateway layer, not application; (5) the response contract — HTTP 429, Retry-After, X-RateLimit-* headers; (6) failure mode — what happens if Redis is down? (fail open vs. fail closed). Interviewers are looking for all six. Most candidates only cover the algorithm part and miss the distributed and failure-mode angles.

How to debug & inspect it

Rate limit bugs come in two flavours: clients hitting 429 they did not expect, and limits that are not enforcing at all (clients sending far more than the limit without being stopped). Both are diagnosable with a handful of curl commands and a careful read of the response headers.

Read the headers on every response

A well-behaved API sends X-RateLimit-* on every 2xx, not just on 429. The fastest diagnostic is to add -I or -v and read the state of the window before you hit the wall:

$ curl -sI https://api.example.com/v1/data \ -H "Authorization: Bearer sk_live_abc" HTTP/2 200 x-ratelimit-limit: 100 x-ratelimit-remaining: 73 x-ratelimit-reset: 1750003600 # Unix timestamp when window resets # Convert reset timestamp to human time: $ date -r 1750003600 Fri Jun 14 15:06:40 2025 # Hit the limit intentionally: $ for i in $(seq 1 110); do curl -so /dev/null -w "%{http_code} " \ -H "Authorization: Bearer sk_live_abc" \ https://api.example.com/v1/data; done 200 200 200 200 200 ... 429 429 429 429 429 # limit hit at request 101

Reproduce a 429 with a tight request loop

When debugging whether a client is the source of its own 429s, fire a controlled burst and read the Retry-After value exactly as a client would:

$ curl -sv -X GET https://api.example.com/v1/search \ -H "Authorization: Bearer sk_live_abc" \ 2>&1 | grep -E "(HTTP|x-ratelimit|retry-after)" < HTTP/2 429 < retry-after: 37 < x-ratelimit-limit: 60 < x-ratelimit-remaining: 0 < x-ratelimit-reset: 1750003637 # Wait 37 seconds (+ jitter), then retry. Never retry immediately.

Diagnose a distributed-counter race

If your API fleet uses a shared Redis counter but clients consistently hit more than the stated limit, the likely cause is a non-atomic read-modify-write instead of an atomic Lua script or INCR:

# Symptom: 3-node fleet, limit 100/min, but a single client can send ~285/min # Cause: each node uses a local counter, not shared Redis $ redis-cli GET "rl:user_42:28333" (nil) # key is missing — this node never wrote to Redis # Correct: shared Redis key should exist and have a value $ redis-cli GET "rl:user_42:28333" "87" # 87 requests counted across all nodes this window $ redis-cli TTL "rl:user_42:28333" 43 # 43 seconds until window resets — matches X-RateLimit-Reset

Diagnose clock skew

In a distributed fleet, windows computed from wall-clock time on different nodes will not align perfectly. The symptom is a fixed-window counter that resets a few milliseconds earlier on some nodes, allowing a small extra burst. Verify by comparing the Redis TIME against the application node's clock:

$ redis-cli TIME 1) "1750003560" # Unix seconds on Redis server 2) "123456" # microseconds $ date +%s 1750003561 # 1 second ahead — acceptable NTP drift 1750003575 # 15 seconds ahead — clock skew large enough to cause window misalignment # Fix: use Redis TIME inside the Lua script rather than passing # the application node's timestamp as ARGV.

Worked numeric token-bucket trace

Capacity B = 10 tokens, refill rate r = 2 tokens/second. Starting state: full bucket (10 tokens). Five events over 6 seconds:

t (s)EventElapsed since lastTokens added (r × elapsed)Tokens before consumeTokens afterResult
0.0Request A010.09.0Allowed
0.2Request B0.2 s0.49.48.4Allowed
0.3Burst: 9 reqs0.1 s0.28.60 (−8.6, rounds to 0)8 allowed, 1 rejected (bucket empty)
2.8Request C2.5 s5.05.04.0Allowed (idle time refilled 5 tokens)
5.8Request D3.0 s6.0min(10, 4+6)=109.0Allowed (capped at capacity)

Key observations: the burst at t=0.3 drained the bucket; the 2.5-second idle at t=2.8 earned 5 tokens back; capping at capacity means idle time never accumulates past the burst maximum, so a client cannot "save up" an unbounded burst by staying quiet for hours.

Symptom → cause → fix

SymptomLikely causeFix
Client hits 429 far below the stated limitPer-node counter not shared — each node enforces the limit independently, so the effective limit is limit / nodesSwitch to shared Redis counter with atomic Lua script
Limit seems to reset twice per window (e.g. at :00 and :30 in a 60-second window)Clock skew between nodes: different nodes compute different window keys for the same real timeUse Redis TIME inside the Lua script for all window-key calculations
429 headers show Retry-After: 0 or a past timestampReset timestamp computed on application node whose clock is behind Redis/truthDerive reset time from Redis TIME + window length
Limit not enforced at all — unlimited traffic passesRedis is down and the code path fails open (returns "allowed" on error)Decide on fail-open vs. fail-closed policy; log and alert on Redis errors regardless
Well-behaved client occasionally sees 429 spikes without sending a burstFixed-window boundary burst from a different client sharing the same Redis key (key is too coarse — e.g. by IP behind NAT)Scope the key more narrowly (API key, user ID); or switch to sliding window counter to eliminate boundary bursts
  1. On every 429, read Retry-After and wait that many seconds plus jitter before retrying — never retry immediately.
  2. Add the rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) to your client wrapper so your code can self-throttle before hitting 429.
  3. In staging, force-expire the Redis key and re-run the request to confirm the counter starts fresh: redis-cli DEL "rl:user_42:28333".
  4. Verify atomicity: grep your Lua script or check that you use INCR (not GET + SET) — a non-atomic read-modify-write will under-count under concurrency.
  5. Check that your key includes both the client identifier and the window key — a key that omits the window component never resets.

In production: how leading APIs do it

Rate limiting is not a configuration toggle — it is an architectural decision that shapes how clients build integrations and how support teams debug outages. Five platforms that handle billions of API calls daily each made different algorithm choices, and those choices are reflected in the headers they return and the client behaviours they encourage.

Company Algorithm Typical limit Client signal
Stripe Token bucket per account, backed by Redis; FOUR cooperating limiters in parallel (see deep-dive below) ~100 read req/s and ~100 write req/s for live-mode keys 429 + Retry-After (seconds until the bucket refills enough for one request)
Shopify Leaky bucket for REST (bucket of ~40 requests leaking at ~2 req/s); calculated query-cost limiter for GraphQL (each field has a cost weight; queries that exceed the remaining bucket cost are rejected) REST: 40-unit bucket refilling at 2/s; GraphQL: 1,000-point bucket refilling at 50 points/s X-Shopify-Shop-Api-Call-Limit (e.g. 32/40) on REST; X-GraphQL-Cost-Include-Fields and throttle extensions in the GraphQL response body
GitHub Fixed window (hourly) for primary limits; separate secondary/abuse heuristics for burst detection 5,000 req/hour for authenticated REST; secondary limits triggered by patterns (many writes in a short window, many concurrent requests) X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used on every response; Retry-After on secondary-limit 429
AWS API Gateway Token bucket (steady-state rate + burst capacity) at the stage and method level; usage plans layer per-key quotas (daily/monthly fixed windows) on top Default: 10,000 req/s steady-state, 5,000 req burst; configurable per usage plan 429 Too Many Requests; no standard rate-limit headers by default (clients must implement exponential backoff; Lambda integrations can add headers manually)
HubSpot Rolling per-account window (10-second bucket) ~100–190 req/10s depending on subscription tier; daily cap of 250,000–1,000,000 requests X-HubSpot-RateLimit-Remaining, X-HubSpot-RateLimit-Reset, X-HubSpot-RateLimit-Daily, X-HubSpot-RateLimit-Daily-Remaining

Deep-dive: Stripe's four cooperating limiters

Stripe's publicly documented architecture runs not one but four rate limiters simultaneously on every inbound request. Each limiter targets a different failure mode, and a request is rejected if any single one of them trips.

The first is a request-rate limiter: a token bucket per account key that enforces the per-second ceiling most developers think of as "the" rate limit. The second is a concurrent-request limiter: it counts in-flight requests for a given key at any instant, regardless of rate. This catches the pattern where a client opens hundreds of connections simultaneously — each individually slow, but collectively exhausting connection pool capacity. The third is a fleet usage load shedder: when the overall fleet utilisation crosses a threshold, a portion of non-critical requests is shed to protect capacity for write operations and webhook delivery. This is not per-account — it is a global signal that fires during traffic spikes or partial outages. The fourth is a worker utilisation load shedder: individual compute workers monitor their own CPU and memory saturation and shed requests locally before they queue deeply. The practical consequence for integrators is that a well-behaved account observing the published req/s limits can still see a 429 during a platform-wide traffic event — and Retry-After is the correct response in all four cases.

This layered approach is worth understanding when designing your own API: a single rate limit rarely covers all the ways a client can overwhelm a system. Concurrent connections, burst depth, and fleet-level health are distinct axes.

How leading APIs do it

By the numbers: what Redis stores, the write, and the delay

Make it concrete. The limiter is capacity B = 100 tokens, refill r = 50 tokens/second (so a sustained 50 req/s, with bursts up to 100). Per account, Redis stores just two fields in a hash rl:{account}:

# rl:acct_42 → { tokens: 100.0, ts: 1718900000.000 } (ts = last refill time, seconds)

Every request runs this read → compute → write as ONE atomic Lua script (atomic so two app servers can't both spend the last token):

now = 1718900000.700 -- request arrives tokens, ts = HGET rl:acct_42 tokens ts -- READ → 100.0, 1718900000.000 elapsed = now - ts -- 0.700 s refilled = min(B, tokens + elapsed * r) -- min(100, 100 + 0.7*50) = 100 (capped) if refilled >= 1: -- ALLOW HSET rl:acct_42 tokens (refilled-1) ts now -- WRITE → 99.0, now return ALLOW, X-RateLimit-Remaining: 99 else: -- DENY retry_after = ceil((1 - refilled) / r) -- seconds until 1 token exists return 429, Retry-After: retry_after

The key formula is the delay a throttled client must wait. When the bucket holds t tokens (0 ≤ t < 1), the next token arrives in (1 − t) / r seconds, so:

Retry-After = ceil( (1 − tokens) / r ) seconds

Trace a burst that drains the bucket — B = 100, r = 50/s, starting full, a client fires 130 requests at t = 0 then keeps polling:

t (s)EventTokens beforeDecisionTokens afterClient told
0.000reqs 1–100 (burst)100 → 1all 100 ALLOW0200 OK
0.000reqs 101–130~030 DENY0429, Retry-After: 1
0.0201 retry (too early)0.02·50 = 1.01 ALLOW0200 OK
1.000steady pollingrefilled to 50 over 1 sALLOW ≤ 50/sdrains to 0200 then 429

The decision math (sizing the bucket): capacity B is the maximum burst you tolerate; refill r is the sustained rate you guarantee. To let a client briefly burst to 2× their steady rate for up to 2 seconds, set B = r + (2r − r)·2 = 3r. And the worst-case extra load a full fleet of C clients can dump in one instant is C × B requests — size your backend so a synchronized burst (C × B) doesn't tip it over, or add the concurrency limiter and load shedders from the Stripe section below.

🧠 Quick check

1. A client sends 100 requests in the last 2 ms of minute 14:07 and 100 more in the first 2 ms of minute 14:08. Both sets are accepted. Which algorithm is in use?

This is the classic boundary burst problem of the fixed window counter. The window resets at the minute boundary, so both sets of 100 requests start fresh counters. Sliding window log would have caught this. Token bucket only passes bursts if there are tokens saved, not due to clock boundaries.

2. You need to protect a payment API endpoint. Accuracy is critical — you cannot allow even a 2× burst. Memory cost is not a concern (very few premium clients). Which algorithm fits best?

Sliding window log is the only algorithm with perfect accuracy — it stores every request's timestamp and evicts based on a real sliding window. For a low-volume, high-stakes endpoint like payment processing, the memory overhead is acceptable and precision is worth it.

3. In a distributed three-node API fleet, each node uses its own in-memory counter with a limit of 100 requests/min. What effective limit does a client experience?

Without a shared store, each node tracks the counter independently. If the load balancer distributes evenly, the client can send 100 req/min to each of the 3 nodes for a de-facto 300 req/min limit — 3× the intended cap. The load balancer does not enforce application-level quotas; it only routes TCP connections.

4. A server responds with HTTP 429 and a Retry-After: 30 header. What must a correctly implemented client do?

Retry-After is an instruction to the client: do not retry before this many seconds have elapsed. Immediately retrying a 429 is the textbook thundering herd failure mode. Jitter (random additional delay) prevents synchronized retries across the client population.

🏗️ Exercise 1 — Classic interview: design a rate limiter for an API (try before reading)
Scenario You are asked to design a rate limiter for a REST API that serves 500 million requests per day across a fleet of 20 application servers. Requirements: 1,000 requests/min per API key; bursts of up to 200 requests in a short window are acceptable; the solution must be correct under concurrent traffic from all 20 nodes.

Model answer:

  1. Choose the algorithm: Token bucket with capacity 200 (allows the stated burst) and refill rate 1000/60 ≈ 16.7 tokens/sec. This handles bursty clients and is configurable per tier.
  2. Shared store: Redis for the token counter. Use a Lua script that atomically reads {tokens, last_refill}, computes the refill based on elapsed time, checks if a token is available, and writes back. A single Lua script is an atomic unit in Redis — no race conditions.
  3. Key schema: rl:{api_key} → hash with fields tokens (float) and last_refill (Unix ms). TTL set to 2× the refill-to-full time.
  4. Where: At the API gateway layer, before requests reach application servers, so upstream services are protected and the limiter is a single enforcement point. Run the gateway in HA (multiple instances with health checks).
  5. Response contract: 429 + Retry-After + X-RateLimit-* on every response including 2xx.
  6. Redis failure mode: Decision point — fail open (allow traffic if Redis is unreachable, losing rate limit protection temporarily) or fail closed (return 503, protecting the API at the cost of availability). For a public monetized API, fail closed is safer.
  7. Observability: Emit metrics for 429 rate by API key and endpoint; alert if a single key accounts for >10% of all traffic; log rejected requests with client info for abuse investigation.

Rubric: ✓ Algorithm choice with justification ✓ Distributed shared store ✓ Atomic operation (no race condition) ✓ Gateway placement ✓ HTTP response contract ✓ Redis failure mode decision ✓ Observability hook. Hitting 5+ = strong answer.

🔍 Exercise 2 — Pick the right algorithm for a scenario
Scenario Match each of the following use cases to the most appropriate rate limiting algorithm and briefly justify your choice.
  1. A data export API allows customers to export their entire dataset. Jobs are CPU-intensive. You want to guarantee the backend is never hit with more than 5 concurrent export jobs per second regardless of how customers batch requests.
  2. A login endpoint must allow no more than 10 failed attempts per user per 15 minutes. Security matters more than performance; the limit must be exact.
  3. A mobile SDK calls a recommendations API. Users send bursts when they open the app (10 requests at once) then nothing for minutes. You want to allow those opening bursts but limit sustained throughput to 60 req/min.
  4. A billing API counts API calls for invoicing purposes. You need a counter per customer per calendar month. A small boundary burst is acceptable; you need minimal memory.

Model answers:

  1. Leaky bucket. You need smoothed output — no matter how requests arrive, the backend sees a constant 5/sec drain rate. Bursts are queued and forwarded at the controlled rate; overflow is dropped with 429.
  2. Sliding window log. Security demands exact enforcement — the fixed window's boundary burst would allow 20 attempts in near-zero time (the classic account lockout bypass). Sliding window log counts precisely and the limit is low (10), so O(limit) memory is fine.
  3. Token bucket. The 10-request opening burst earns token credit during idle time between app opens. The refill rate caps sustained throughput at 60/min (1 token/sec) while the bucket capacity of 10 allows the opening burst to drain immediately.
  4. Fixed window counter. Monthly billing is the canonical use case for fixed windows — the billing period is itself a fixed window. A small boundary burst is explicitly acceptable, and O(1) memory per customer keeps costs down at scale.

Rubric: Each answer should name the algorithm and give one concrete reason tied to the scenario's constraints. Answers that justify by elimination (e.g., "sliding window log is too memory-heavy for monthly billing") are equally valid.

Key takeaways

Sources & further reading