Reliability & Scale · Lesson 03

Rate Limiting Algorithms

Every public API eventually gets hammered — by a runaway client loop, a DDoS, or just unexpected success. Rate limiting is how you protect capacity and enforce fair use. The algorithm you pick shapes whether bursts are allowed, how much memory you spend, and how precisely the limit holds.

⏱ 22 min Difficulty: advanced Prereq: HTTP basics, idempotency

By the end you'll be able to

Explain all five canonical rate limiting algorithms, their trade-offs, and when to use each.
Design a distributed rate limiter using a shared store (Redis) that handles clock skew and atomic updates.
Define the correct HTTP response contract for rate limit errors, including the headers clients must respect.

Why rate limiting is hard to get right

A rate limit sounds simple: "no more than 100 requests per minute." But the moment you ask "per minute starting when?" or "what if the same client fires 99 requests in the last millisecond of minute one and 99 more in the first millisecond of minute two?" you discover the problem is genuinely subtle. Different algorithms make different trade-offs between burst tolerance, memory cost, and accuracy. Picking the wrong one can mean either blocking legitimate clients or letting a determined abuser slip through.

Think of rate limiting like a nightclub bouncer. A simple bouncer counts tickets for the hour and stops at capacity — but two groups can each cram in at 11:59 and 12:00, filling the place twice over in two minutes. A smarter bouncer has a sliding memory of who came in over the last 60 minutes, not the current clock-aligned block. Even smarter: a token dispenser that gives out admission tokens at a steady drip, letting crowds absorb small bursts without overwhelming the dance floor.

Algorithm 1 — Fixed window counter

How it works

Divide time into fixed, non-overlapping buckets (e.g., "the minute 14:07:00–14:07:59"). Keep one counter per (client, window). On each request: compute the current window key, increment the counter, and reject if it exceeds the limit. At the start of a new window, the counter resets to zero.

The window key is typically floor(now / window_seconds). This is a single integer operation, which is why it's trivially cheap: one integer stored per client.

# Fixed window counter — pseudo-code
function is_allowed(client_id, limit, window_seconds):
  now         = current_unix_timestamp()
  window_key  = floor(now / window_seconds)
  redis_key   = "rl:" + client_id + ":" + window_key

  count = INCR(redis_key)          # atomic; returns new value
  if count == 1:
    EXPIRE(redis_key, window_seconds * 2)  # clean up old keys

  return count <= limit

Fig 1 — The fixed-window boundary burst: a client can legally send 2× the intended limit in near-zero time by straddling the window edge.

Pros and cons

Pro	Con
Trivially simple — one counter per window	Boundary burst: 2× effective limit at window edge
O(1) memory per client	Inaccurate at short time scales
One atomic INCR on Redis	Counter resets abruptly — sawtooth traffic
Easy to reason about and explain	Not suitable when precise enforcement is critical

Best for: coarse-grained quota enforcement where a 2× burst at boundaries is acceptable — e.g., "no more than 10,000 API calls per day per account" for billing purposes.

Algorithm 2 — Sliding window log

How it works

Instead of a single counter, keep a log (sorted set) of every request's timestamp for each client. On every request, evict entries older than the window, count the remaining entries, and reject if over the limit. The window is always the last N seconds ending right now — it slides with real time.

# Sliding window log — pseudo-code (Redis sorted set)
function is_allowed(client_id, limit, window_ms):
  now      = current_unix_ms()
  key      = "rl:" + client_id
  cutoff   = now - window_ms

  # Atomic Lua script (all-or-nothing, no race condition)
  lua_script = """
    local key     = KEYS[1]
    local now     = tonumber(ARGV[1])
    local cutoff  = tonumber(ARGV[2])
    local limit   = tonumber(ARGV[3])
    local window  = tonumber(ARGV[4])

    redis.call('ZREMRANGEBYSCORE', key, '-inf', cutoff)  -- evict old
    local count = redis.call('ZCARD', key)               -- count current
    if count < limit then
      redis.call('ZADD', key, now, now)                  -- log this request
      redis.call('EXPIRE', key, window / 1000)
      return 1   -- allowed
    end
    return 0     -- rejected
  """

  return EVAL(lua_script, key, now, cutoff, limit, window_ms) == 1

Fig 2 — Sliding window log. The window is anchored to "now" and rolls forward continuously. No boundary-burst problem exists, but every request stores a timestamp entry in memory.

Pros and cons

Pro	Con
Perfectly accurate — no boundary burst possible	Memory grows with traffic: O(limit) entries per client
Works at any granularity	Eviction scan on every request (ZREMRANGEBYSCORE)
Easy to inspect (the log is readable)	For high-limit accounts (e.g., 100k/min) memory is prohibitive

Best for: low-to-medium volume APIs where precision is critical — security-sensitive endpoints, payment flows — and per-client limits are not enormous.

Algorithm 3 — Sliding window counter

How it works

This is the sweet spot between fixed window (cheap but imprecise) and log (precise but heavy). Keep two fixed-window counters: the current window and the previous window. When evaluating a request at time t, compute a weighted estimate of how much of the previous window's traffic still falls inside the last N seconds:

estimated_count = prev_count × (1 − elapsed_fraction) + curr_count

If elapsed_fraction = 0.3 (30% through the current window), then 70% of the previous window's requests are still "recent." The approximation error is at most ~1/limit of the true rate — negligible for limits above ~10.

# Sliding window counter — pseudo-code
function is_allowed(client_id, limit, window_seconds):
  now              = current_unix_timestamp()
  curr_window      = floor(now / window_seconds)
  prev_window      = curr_window - 1
  elapsed_fraction = (now % window_seconds) / window_seconds

  curr_key = "rl:" + client_id + ":" + curr_window
  prev_key = "rl:" + client_id + ":" + prev_window

  curr_count = GET(curr_key) OR 0
  prev_count = GET(prev_key) OR 0

  # Weighted estimate: how many requests in the last window_seconds?
  estimate = prev_count × (1.0 - elapsed_fraction) + curr_count

  if estimate >= limit:
    return REJECTED

  INCR(curr_key)
  if curr_count == 0:
    EXPIRE(curr_key, window_seconds * 2)
  return ALLOWED

✅ Why the weighted formula works

Imagine a 60-second window. You're 18 seconds into the current window (30% elapsed). If the previous window had 80 requests, we assume 70% of them (≈56) still "count" toward the last 60 seconds. Add the current window's 20 requests: an estimated 76 out of a limit of 100. The error is bounded because the linear interpolation underestimates real traffic by at most limit/2 under pathological distributions — entirely acceptable for soft quotas.

Pros and cons

Pro	Con
O(1) memory — just two counters per client	Approximation: ~0–3% error near window boundary
Eliminates most of the boundary burst problem	Slightly more complex to implement than fixed window
Works well for large per-client limits	Error grows if prev window traffic was extremely bursty

Best for: general-purpose API rate limiting at scale. This is the algorithm used by most production API gateways because it balances accuracy and storage cost well. Cloudflare uses a variant of this approach.

Algorithm 4 — Token bucket

How it works

Imagine a physical bucket that holds up to B tokens. Tokens drip in at a constant refill rate r (tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. If you haven't made any requests, tokens accumulate — up to the bucket's maximum capacity — which means you can legitimately absorb a short burst whenever you've been idle.

The key insight: the bucket capacity B controls the maximum burst size; the refill rate r controls the sustained rate. They're independently configurable.

# Token bucket — pseudo-code (lazy refill, no background thread needed)
function is_allowed(client_id, capacity, refill_rate):
  # Load persisted state {tokens, last_refill_time}
  state = LOAD(client_id) OR { tokens: capacity, last_refill: now() }

  now_ts       = current_unix_timestamp()
  elapsed      = now_ts - state.last_refill
  new_tokens   = elapsed × refill_rate

  # Refill up to capacity (lazy — computed at request time, not on a timer)
  state.tokens = MIN(capacity, state.tokens + new_tokens)
  state.last_refill = now_ts

  if state.tokens < 1:
    SAVE(client_id, state)
    return REJECTED  # bucket empty

  state.tokens -= 1
  SAVE(client_id, state)
  return ALLOWED

Fig 3 — Token bucket. Idle time accumulates tokens; burst requests drain them. The refill rate r caps sustained throughput; the bucket size B caps burst size.

Pros and cons

Pro	Con
Naturally allows controlled bursts — a well-behaved client earns burst credit by being quiet	Two parameters to tune (capacity and refill rate) — can be confusing to expose to users
Smooth interaction with bursty legitimate traffic patterns	Requires storing {tokens, last_refill} atomically — slightly harder than a counter
Works well for traffic shaping at the individual request level	In distributed settings, floating-point rounding of tokens can drift

Best for: APIs with naturally bursty clients — mobile apps, SDKs with retry logic, batch importers. The AWS API Gateway, Stripe, and many payment APIs use token-bucket semantics because clients doing legitimate bulk operations earn burst capacity through prior quiet periods.

Algorithm 5 — Leaky bucket

How it works

Flip the mental model: instead of tokens dripping in and requests consuming them, requests drip in and a drain processes them at a constant rate. Imagine a physical bucket with a hole at the bottom. Water (requests) pours in at any rate; it drips out at a fixed rate. If the bucket overflows (queue is full), excess requests are dropped. The key property: the output is always at a constant, predictable rate regardless of the input shape.

# Leaky bucket — pseudo-code (queue-based)
function handle_request(client_id, request, queue_capacity, drain_rate):
  queue = GET_QUEUE(client_id)  # in-memory or Redis list

  if len(queue) >= queue_capacity:
    return REJECTED  # bucket overflowed

  ENQUEUE(queue, request)
  return QUEUED  # will be drained at drain_rate

# Background drain loop (runs at drain_rate)
function drain_loop(client_id, drain_rate):
  interval = 1.0 / drain_rate  # seconds between each drain
  while True:
    request = DEQUEUE(GET_QUEUE(client_id))
    if request:
      FORWARD_TO_BACKEND(request)
    SLEEP(interval)

⚠️ Leaky bucket is not quite what you expect

Many engineers confuse "leaky bucket" with "token bucket" because both have a fixed rate. The difference: leaky bucket smooths output — it queues and drains at a constant rate, so downstream services see metered traffic even if clients burst. Token bucket passes bursts through immediately up to the bucket capacity. In network engineering the leaky bucket is used for traffic shaping (e.g., policing egress bandwidth); token bucket is used for admission control.

Pros and cons

Pro	Con
Absolutely smooth output rate — protects downstream services from spikes	Adds latency: requests sit in the queue before being forwarded
Great for protecting fragile backends with low variance tolerance	Queue requires bounded memory; overflow silently drops requests
Conceptually simple drain model	Does not allow legitimate bursts — idle time is not "banked"

Best for: network egress shaping or protecting downstream services (databases, third-party APIs) that have strict throughput limits and cannot tolerate any burstiness. Less common for public API rate limiting, more common at the infrastructure level.

Algorithm comparison

Algorithm	Burst handling	Memory cost	Accuracy	Implementation complexity	Best use case
Fixed window counter	2× burst at boundary	O(1) per client	Low near boundaries	Trivial	Coarse daily/monthly billing quotas
Sliding window log	No burst allowed beyond limit	O(limit) per client	Perfect	Moderate	Low-volume, high-precision (e.g. auth endpoints)
Sliding window counter	Minimal (<1% boundary burst)	O(1) per client (2 counters)	Very high (~99%+)	Moderate	General-purpose API rate limiting at scale
Token bucket	Allows bursts up to bucket size B	O(1) per client	Exact (token math)	Moderate	Bursty API clients; AWS/Stripe/mobile SDKs
Leaky bucket	No burst — queue drains at constant rate	O(queue_capacity)	N/A (rate shaper, not limiter)	Complex (requires drain loop)	Traffic shaping; protecting fragile downstream services

Where to enforce rate limits

Rate limiting can live at several layers. Each layer has a different view of the traffic and different failure modes.

Layer	What it sees	When to use	Blind spot
CDN / edge (Cloudflare, Fastly)	Raw IP before TLS termination	DDoS protection, bot mitigation	Shared IPs (NAT, offices) count as one client
API gateway (Kong, AWS API GW)	Authenticated client identity, API key	Per-key quotas, monetization tiers	Adds one network hop; single point of failure if not HA
Application server	Full request context (user ID, endpoint)	Fine-grained per-endpoint limits	Each instance enforces independently without shared store
Service mesh (Envoy, Istio)	Service-to-service traffic	Internal quotas between microservices	Adds complexity; overkill for simple APIs

Most production systems stack layers: CDN for IP-based abuse, API gateway for authenticated quota enforcement, and application-level logic for fine-grained per-operation limits.

Distributed rate limiting

In a single-server deployment, an in-memory counter is sufficient. In a horizontally scaled fleet, every instance needs to see the same counter — otherwise "limit 100 req/min" becomes "limit 100 per node" and your actual limit is 100 × number of nodes. The canonical solution is a shared external store, almost universally Redis in practice.

Atomic operations: INCR and Lua scripts

Redis is single-threaded for command execution. A simple INCR is guaranteed atomic: two simultaneous calls from different nodes will both succeed, and the counter will reach exactly 2 — no lost increments. For operations that need to read and conditionally write (like token bucket's "check tokens, decrement if allowed"), a Lua script is atomic as a unit because Redis executes the entire script without yielding:

-- Redis Lua: atomic fixed-window check-and-increment
local key     = KEYS[1]
local limit   = tonumber(ARGV[1])
local window  = tonumber(ARGV[2])

local count   = redis.call('INCR', key)
if count == 1 then
  redis.call('EXPIRE', key, window)  -- set TTL only on first write
end
if count > limit then
  return 0   -- rejected
end
return 1     -- allowed

Clock skew

In a distributed fleet, different nodes have slightly different wall-clock readings — usually milliseconds to tens of milliseconds apart due to NTP drift. For fixed-window counters this means a request arriving at 14:07:59.998 on one node and 14:08:00.003 on another (due to network latency + clock drift) will be counted in different windows. The practical impact is small — it's essentially an additional source of the boundary-burst error — but it means you should not rely on rate limiters for exact second-level enforcement. If you need millisecond precision, use a centralized Redis with timestamps from the Redis server itself (TIME command) rather than trusting the application node's clock.

-- Use Redis server time, not application node time
local time   = redis.call('TIME')         -- {seconds, microseconds}
local now_ms = tonumber(time[1]) * 1000 + math.floor(tonumber(time[2]) / 1000)

Fig 4 — Distributed rate limiting. Without a shared store, each API node enforces the limit independently, multiplying it by the fleet size. Redis atomic operations collapse the view to a single consistent counter.

The HTTP response contract for rate limits

When a request is rejected for exceeding the rate limit, the server must return HTTP 429 Too Many Requests. Simply returning 429 is not enough — clients need machine-readable information to implement proper backoff. The standard headers are:

HTTP/1.1 429 Too Many Requests Content-Type: application/json X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1718880060 Retry-After: 23 {"error": "rate_limit_exceeded", "message": "100 requests per minute. Retry after 23 seconds."}

Header	Value	Purpose
`X-RateLimit-Limit`	The maximum allowed (e.g., 100)	Tells the client what the ceiling is
`X-RateLimit-Remaining`	Calls left in this window (0 on rejection)	Lets well-behaved clients self-throttle before hitting 429
`X-RateLimit-Reset`	Unix timestamp when the window resets	Client knows exactly when to retry; also send on every 2xx response
`Retry-After`	Seconds to wait (or an HTTP-date)	IETF standard; clients must respect this for 429 and 503

✅ Send rate limit headers on every response, not just 429

Proactively include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every successful 2xx response. A well-built client SDK reads these on every response and proactively slows down when Remaining drops near zero, avoiding the 429 entirely. Headers only on the error response means clients have already burned the quota before they know about it.

Client behavior: the right way to handle 429

# Client-side rate limit handling — pseudo-code
function api_call(endpoint, max_retries=5):
  for attempt in range(max_retries):
    response = HTTP_GET(endpoint)

    # On every response: track remaining quota
    remaining = response.headers['X-RateLimit-Remaining']
    reset_at  = response.headers['X-RateLimit-Reset']

    if response.status == 200:
      return response.body

    if response.status == 429:
      retry_after = response.headers.get('Retry-After', 60)
      wait_seconds = float(retry_after) + jitter()  # add jitter!
      LOG("Rate limited. Waiting " + wait_seconds + "s")
      SLEEP(wait_seconds)
      continue

    raise UnexpectedError(response.status)

  raise MaxRetriesExceeded()

⚠️ Never retry 429 without a wait — that's the thundering herd problem

When thousands of clients all receive 429 simultaneously and immediately retry, they arrive at the server as a synchronized wave — a thundering herd. Each retry triggers another 429, which triggers another immediate retry. Always add a wait (honoring Retry-After) plus jitter (a random additional delay) to desynchronize retries across the client population. Without jitter, exponential backoff still causes synchronized bursts at the backoff intervals.

🎯 Interview angle — "Design a rate limiter"

This is one of the most common system design interview questions. A complete answer covers: (1) clarify requirements — per-user or per-IP? which algorithm? what's the limit? (2) choose an algorithm — justify token bucket or sliding window counter for most cases; (3) distributed implementation — Redis with atomic Lua for shared state; (4) where to place it — gateway layer, not application; (5) the response contract — HTTP 429, Retry-After, X-RateLimit-* headers; (6) failure mode — what happens if Redis is down? (fail open vs. fail closed). Interviewers are looking for all six. Most candidates only cover the algorithm part and miss the distributed and failure-mode angles.

How to debug & inspect it

Rate limit bugs come in two flavours: clients hitting 429 they did not expect, and limits that are not enforcing at all (clients sending far more than the limit without being stopped). Both are diagnosable with a handful of curl commands and a careful read of the response headers.

Read the headers on every response

A well-behaved API sends X-RateLimit-* on every 2xx, not just on 429. The fastest diagnostic is to add -I or -v and read the state of the window before you hit the wall:

$ curl -sI https://api.example.com/v1/data \ -H "Authorization: Bearer sk_live_abc" HTTP/2 200 x-ratelimit-limit: 100 x-ratelimit-remaining: 73 x-ratelimit-reset: 1750003600 # Unix timestamp when window resets # Convert reset timestamp to human time: $ date -r 1750003600 Fri Jun 14 15:06:40 2025 # Hit the limit intentionally: $ for i in $(seq 1 110); do curl -so /dev/null -w "%{http_code} " \ -H "Authorization: Bearer sk_live_abc" \ https://api.example.com/v1/data; done 200 200 200 200 200 ... 429 429 429 429 429 # limit hit at request 101

Reproduce a 429 with a tight request loop

When debugging whether a client is the source of its own 429s, fire a controlled burst and read the Retry-After value exactly as a client would:

$ curl -sv -X GET https://api.example.com/v1/search \ -H "Authorization: Bearer sk_live_abc" \ 2>&1 | grep -E "(HTTP|x-ratelimit|retry-after)" < HTTP/2 429 < retry-after: 37 < x-ratelimit-limit: 60 < x-ratelimit-remaining: 0 < x-ratelimit-reset: 1750003637 # Wait 37 seconds (+ jitter), then retry. Never retry immediately.

Diagnose a distributed-counter race

If your API fleet uses a shared Redis counter but clients consistently hit more than the stated limit, the likely cause is a non-atomic read-modify-write instead of an atomic Lua script or INCR:

# Symptom: 3-node fleet, limit 100/min, but a single client can send ~285/min # Cause: each node uses a local counter, not shared Redis $ redis-cli GET "rl:user_42:28333" (nil) # key is missing — this node never wrote to Redis # Correct: shared Redis key should exist and have a value $ redis-cli GET "rl:user_42:28333" "87" # 87 requests counted across all nodes this window $ redis-cli TTL "rl:user_42:28333" 43 # 43 seconds until window resets — matches X-RateLimit-Reset

Diagnose clock skew

In a distributed fleet, windows computed from wall-clock time on different nodes will not align perfectly. The symptom is a fixed-window counter that resets a few milliseconds earlier on some nodes, allowing a small extra burst. Verify by comparing the Redis TIME against the application node's clock:

$ redis-cli TIME 1) "1750003560" # Unix seconds on Redis server 2) "123456" # microseconds $ date +%s 1750003561 # 1 second ahead — acceptable NTP drift 1750003575 # 15 seconds ahead — clock skew large enough to cause window misalignment # Fix: use Redis TIME inside the Lua script rather than passing # the application node's timestamp as ARGV.

Worked numeric token-bucket trace

Capacity B = 10 tokens, refill rate r = 2 tokens/second. Starting state: full bucket (10 tokens). Five events over 6 seconds:

t (s)	Event	Elapsed since last	Tokens added (r × elapsed)	Tokens before consume	Tokens after	Result
0.0	Request A	—	0	10.0	9.0	Allowed
0.2	Request B	0.2 s	0.4	9.4	8.4	Allowed
0.3	Burst: 9 reqs	0.1 s	0.2	8.6	0 (−8.6, rounds to 0)	8 allowed, 1 rejected (bucket empty)
2.8	Request C	2.5 s	5.0	5.0	4.0	Allowed (idle time refilled 5 tokens)
5.8	Request D	3.0 s	6.0	min(10, 4+6)=10	9.0	Allowed (capped at capacity)

Key observations: the burst at t=0.3 drained the bucket; the 2.5-second idle at t=2.8 earned 5 tokens back; capping at capacity means idle time never accumulates past the burst maximum, so a client cannot "save up" an unbounded burst by staying quiet for hours.

Symptom → cause → fix

Symptom	Likely cause	Fix
Client hits 429 far below the stated limit	Per-node counter not shared — each node enforces the limit independently, so the effective limit is `limit / nodes`	Switch to shared Redis counter with atomic Lua script
Limit seems to reset twice per window (e.g. at :00 and :30 in a 60-second window)	Clock skew between nodes: different nodes compute different window keys for the same real time	Use Redis `TIME` inside the Lua script for all window-key calculations
429 headers show `Retry-After: 0` or a past timestamp	Reset timestamp computed on application node whose clock is behind Redis/truth	Derive reset time from Redis `TIME` + window length
Limit not enforced at all — unlimited traffic passes	Redis is down and the code path fails open (returns "allowed" on error)	Decide on fail-open vs. fail-closed policy; log and alert on Redis errors regardless
Well-behaved client occasionally sees 429 spikes without sending a burst	Fixed-window boundary burst from a different client sharing the same Redis key (key is too coarse — e.g. by IP behind NAT)	Scope the key more narrowly (API key, user ID); or switch to sliding window counter to eliminate boundary bursts

On every 429, read Retry-After and wait that many seconds plus jitter before retrying — never retry immediately.
Add the rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) to your client wrapper so your code can self-throttle before hitting 429.
In staging, force-expire the Redis key and re-run the request to confirm the counter starts fresh: redis-cli DEL "rl:user_42:28333".
Verify atomicity: grep your Lua script or check that you use INCR (not GET + SET) — a non-atomic read-modify-write will under-count under concurrency.
Check that your key includes both the client identifier and the window key — a key that omits the window component never resets.

In production: how leading APIs do it

Rate limiting is not a configuration toggle — it is an architectural decision that shapes how clients build integrations and how support teams debug outages. Five platforms that handle billions of API calls daily each made different algorithm choices, and those choices are reflected in the headers they return and the client behaviours they encourage.

Company	Algorithm	Typical limit	Client signal
Stripe	Token bucket per account, backed by Redis; FOUR cooperating limiters in parallel (see deep-dive below)	~100 read req/s and ~100 write req/s for live-mode keys	`429` + `Retry-After` (seconds until the bucket refills enough for one request)
Shopify	Leaky bucket for REST (bucket of ~40 requests leaking at ~2 req/s); calculated query-cost limiter for GraphQL (each field has a cost weight; queries that exceed the remaining bucket cost are rejected)	REST: 40-unit bucket refilling at 2/s; GraphQL: 1,000-point bucket refilling at 50 points/s	`X-Shopify-Shop-Api-Call-Limit` (e.g. `32/40`) on REST; `X-GraphQL-Cost-Include-Fields` and throttle extensions in the GraphQL response body
GitHub	Fixed window (hourly) for primary limits; separate secondary/abuse heuristics for burst detection	5,000 req/hour for authenticated REST; secondary limits triggered by patterns (many writes in a short window, many concurrent requests)	`X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`, `X-RateLimit-Used` on every response; `Retry-After` on secondary-limit 429
AWS API Gateway	Token bucket (steady-state rate + burst capacity) at the stage and method level; usage plans layer per-key quotas (daily/monthly fixed windows) on top	Default: 10,000 req/s steady-state, 5,000 req burst; configurable per usage plan	`429 Too Many Requests`; no standard rate-limit headers by default (clients must implement exponential backoff; Lambda integrations can add headers manually)
HubSpot	Rolling per-account window (10-second bucket)	~100–190 req/10s depending on subscription tier; daily cap of 250,000–1,000,000 requests	`X-HubSpot-RateLimit-Remaining`, `X-HubSpot-RateLimit-Reset`, `X-HubSpot-RateLimit-Daily`, `X-HubSpot-RateLimit-Daily-Remaining`

Deep-dive: Stripe's four cooperating limiters

Stripe's publicly documented architecture runs not one but four rate limiters simultaneously on every inbound request. Each limiter targets a different failure mode, and a request is rejected if any single one of them trips.

The first is a request-rate limiter: a token bucket per account key that enforces the per-second ceiling most developers think of as "the" rate limit. The second is a concurrent-request limiter: it counts in-flight requests for a given key at any instant, regardless of rate. This catches the pattern where a client opens hundreds of connections simultaneously — each individually slow, but collectively exhausting connection pool capacity. The third is a fleet usage load shedder: when the overall fleet utilisation crosses a threshold, a portion of non-critical requests is shed to protect capacity for write operations and webhook delivery. This is not per-account — it is a global signal that fires during traffic spikes or partial outages. The fourth is a worker utilisation load shedder: individual compute workers monitor their own CPU and memory saturation and shed requests locally before they queue deeply. The practical consequence for integrators is that a well-behaved account observing the published req/s limits can still see a 429 during a platform-wide traffic event — and Retry-After is the correct response in all four cases.

This layered approach is worth understanding when designing your own API: a single rate limit rarely covers all the ways a client can overwhelm a system. Concurrent connections, burst depth, and fleet-level health are distinct axes.

How leading APIs do it

By the numbers: what Redis stores, the write, and the delay

Make it concrete. The limiter is capacity B = 100 tokens, refill r = 50 tokens/second (so a sustained 50 req/s, with bursts up to 100). Per account, Redis stores just two fields in a hash rl:{account}:

# rl:acct_42 → { tokens: 100.0, ts: 1718900000.000 } (ts = last refill time, seconds)

Every request runs this read → compute → write as ONE atomic Lua script (atomic so two app servers can't both spend the last token):

now = 1718900000.700 -- request arrives tokens, ts = HGET rl:acct_42 tokens ts -- READ → 100.0, 1718900000.000 elapsed = now - ts -- 0.700 s refilled = min(B, tokens + elapsed * r) -- min(100, 100 + 0.7*50) = 100 (capped) if refilled >= 1: -- ALLOW HSET rl:acct_42 tokens (refilled-1) ts now -- WRITE → 99.0, now return ALLOW, X-RateLimit-Remaining: 99 else: -- DENY retry_after = ceil((1 - refilled) / r) -- seconds until 1 token exists return 429, Retry-After: retry_after

The key formula is the delay a throttled client must wait. When the bucket holds t tokens (0 ≤ t < 1), the next token arrives in (1 − t) / r seconds, so:

Retry-After = ceil( (1 − tokens) / r ) seconds

Trace a burst that drains the bucket — B = 100, r = 50/s, starting full, a client fires 130 requests at t = 0 then keeps polling:

t (s)	Event	Tokens before	Decision	Tokens after	Client told
0.000	reqs 1–100 (burst)	100 → 1	all 100 ALLOW	0	200 OK
0.000	reqs 101–130	~0	30 DENY	0	429, Retry-After: 1
0.020	1 retry (too early)	0.02·50 = 1.0	1 ALLOW	0	200 OK
1.000	steady polling	refilled to 50 over 1 s	ALLOW ≤ 50/s	drains to 0	200 then 429

The decision math (sizing the bucket): capacity B is the maximum burst you tolerate; refill r is the sustained rate you guarantee. To let a client briefly burst to 2× their steady rate for up to 2 seconds, set B = r + (2r − r)·2 = 3r. And the worst-case extra load a full fleet of C clients can dump in one instant is C × B requests — size your backend so a synchronized burst (C × B) doesn't tip it over, or add the concurrency limiter and load shedders from the Stripe section below.

🧠 Quick check

1. A client sends 100 requests in the last 2 ms of minute 14:07 and 100 more in the first 2 ms of minute 14:08. Both sets are accepted. Which algorithm is in use?

This is the classic boundary burst problem of the fixed window counter. The window resets at the minute boundary, so both sets of 100 requests start fresh counters. Sliding window log would have caught this. Token bucket only passes bursts if there are tokens saved, not due to clock boundaries.

2. You need to protect a payment API endpoint. Accuracy is critical — you cannot allow even a 2× burst. Memory cost is not a concern (very few premium clients). Which algorithm fits best?

Sliding window log is the only algorithm with perfect accuracy — it stores every request's timestamp and evicts based on a real sliding window. For a low-volume, high-stakes endpoint like payment processing, the memory overhead is acceptable and precision is worth it.

3. In a distributed three-node API fleet, each node uses its own in-memory counter with a limit of 100 requests/min. What effective limit does a client experience?

Without a shared store, each node tracks the counter independently. If the load balancer distributes evenly, the client can send 100 req/min to each of the 3 nodes for a de-facto 300 req/min limit — 3× the intended cap. The load balancer does not enforce application-level quotas; it only routes TCP connections.

4. A server responds with HTTP 429 and a Retry-After: 30 header. What must a correctly implemented client do?

Retry-After is an instruction to the client: do not retry before this many seconds have elapsed. Immediately retrying a 429 is the textbook thundering herd failure mode. Jitter (random additional delay) prevents synchronized retries across the client population.

🏗️ Exercise 1 — Classic interview: design a rate limiter for an API (try before reading)

Scenario You are asked to design a rate limiter for a REST API that serves 500 million requests per day across a fleet of 20 application servers. Requirements: 1,000 requests/min per API key; bursts of up to 200 requests in a short window are acceptable; the solution must be correct under concurrent traffic from all 20 nodes.

Model answer:

Choose the algorithm: Token bucket with capacity 200 (allows the stated burst) and refill rate 1000/60 ≈ 16.7 tokens/sec. This handles bursty clients and is configurable per tier.
Shared store: Redis for the token counter. Use a Lua script that atomically reads {tokens, last_refill}, computes the refill based on elapsed time, checks if a token is available, and writes back. A single Lua script is an atomic unit in Redis — no race conditions.
Key schema: rl:{api_key} → hash with fields tokens (float) and last_refill (Unix ms). TTL set to 2× the refill-to-full time.
Where: At the API gateway layer, before requests reach application servers, so upstream services are protected and the limiter is a single enforcement point. Run the gateway in HA (multiple instances with health checks).
Response contract: 429 + Retry-After + X-RateLimit-* on every response including 2xx.
Redis failure mode: Decision point — fail open (allow traffic if Redis is unreachable, losing rate limit protection temporarily) or fail closed (return 503, protecting the API at the cost of availability). For a public monetized API, fail closed is safer.
Observability: Emit metrics for 429 rate by API key and endpoint; alert if a single key accounts for >10% of all traffic; log rejected requests with client info for abuse investigation.

Rubric: ✓ Algorithm choice with justification ✓ Distributed shared store ✓ Atomic operation (no race condition) ✓ Gateway placement ✓ HTTP response contract ✓ Redis failure mode decision ✓ Observability hook. Hitting 5+ = strong answer.

🔍 Exercise 2 — Pick the right algorithm for a scenario

Scenario Match each of the following use cases to the most appropriate rate limiting algorithm and briefly justify your choice.

A data export API allows customers to export their entire dataset. Jobs are CPU-intensive. You want to guarantee the backend is never hit with more than 5 concurrent export jobs per second regardless of how customers batch requests.
A login endpoint must allow no more than 10 failed attempts per user per 15 minutes. Security matters more than performance; the limit must be exact.
A mobile SDK calls a recommendations API. Users send bursts when they open the app (10 requests at once) then nothing for minutes. You want to allow those opening bursts but limit sustained throughput to 60 req/min.
A billing API counts API calls for invoicing purposes. You need a counter per customer per calendar month. A small boundary burst is acceptable; you need minimal memory.

Model answers:

Leaky bucket. You need smoothed output — no matter how requests arrive, the backend sees a constant 5/sec drain rate. Bursts are queued and forwarded at the controlled rate; overflow is dropped with 429.
Sliding window log. Security demands exact enforcement — the fixed window's boundary burst would allow 20 attempts in near-zero time (the classic account lockout bypass). Sliding window log counts precisely and the limit is low (10), so O(limit) memory is fine.
Token bucket. The 10-request opening burst earns token credit during idle time between app opens. The refill rate caps sustained throughput at 60/min (1 token/sec) while the bucket capacity of 10 allows the opening burst to drain immediately.
Fixed window counter. Monthly billing is the canonical use case for fixed windows — the billing period is itself a fixed window. A small boundary burst is explicitly acceptable, and O(1) memory per customer keeps costs down at scale.

Rubric: Each answer should name the algorithm and give one concrete reason tied to the scenario's constraints. Answers that justify by elimination (e.g., "sliding window log is too memory-heavy for monthly billing") are equally valid.

Key takeaways

Fixed window counter is cheapest but allows a 2× boundary burst — fine for coarse billing quotas, not for security-sensitive limits.
Sliding window log is perfectly accurate but stores O(limit) entries per client — best for low-volume, high-stakes endpoints.
Sliding window counter (weighted approximation of two fixed windows) gives near-perfect accuracy at O(1) memory — the general-purpose default for most APIs.
Token bucket allows controlled bursts up to the bucket capacity — ideal for bursty clients that earn burst credit through idle time.
Leaky bucket smooths output to a constant rate — it's a traffic shaper, not just an admission controller; adds latency via the queue.
In distributed systems, use a shared Redis store with atomic Lua scripts to avoid per-node over-counting. Use Redis server time to avoid clock skew.
Always return HTTP 429 with Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset — on every response, not just on rejection.
Clients must honor Retry-After and add jitter to prevent the thundering-herd retry storm.

Rate Limiting Algorithms

By the end you'll be able to

Why rate limiting is hard to get right

Algorithm 1 — Fixed window counter

How it works

Pros and cons

Algorithm 2 — Sliding window log

How it works

Pros and cons

Algorithm 3 — Sliding window counter

How it works

Pros and cons

Algorithm 4 — Token bucket

How it works

Pros and cons

Algorithm 5 — Leaky bucket

How it works

Pros and cons

Algorithm comparison

Where to enforce rate limits

Distributed rate limiting

Atomic operations: INCR and Lua scripts

Clock skew

The HTTP response contract for rate limits

Client behavior: the right way to handle 429

How to debug & inspect it

Read the headers on every response

Reproduce a 429 with a tight request loop

Diagnose a distributed-counter race

Diagnose clock skew

Worked numeric token-bucket trace

Symptom → cause → fix

In production: how leading APIs do it

Deep-dive: Stripe's four cooperating limiters

By the numbers: what Redis stores, the write, and the delay

🧠 Quick check

Key takeaways

Sources & further reading