API Design

Platform & API Product Engineering · Lesson 01

Nested & Tiered Rate Limits

A single inbound request does not face one rate limit — it faces a stack of them, evaluated simultaneously. A per-key burst limit, a per-app sustained rate, and a per-org daily quota all fire in parallel, and a request is rejected the moment any one of them trips. Getting this composition right is harder than any individual algorithm.

⏱ 28 min Difficulty: advanced Prereq: Rate Limiting Algorithms (rel-03)

By the end you'll be able to

Why a single rate limit is never enough

A developer integrating with your API might issue a burst of 200 requests in a second testing their new feature — that is a per-key burst concern. Their app, overall, sustains 80 requests per second across all keys — that is a per-app throughput concern. Their whole organization, across 20 apps, consumes 900,000 API calls today — that is a billing and capacity concern. These are three distinct failure modes at three distinct time scales. A single token bucket conflates them.

Think of it like airport security. Your boarding pass (API key) is checked at the gate — that is the burst check, closest to the individual. The airline's total load on a given aircraft (per-app rate) limits how many passengers they can board per minute regardless of individual keys. And the airport has a daily throughput limit before ground operations degrade (the org daily cap). Passing one check does not mean you pass the others.

Real platforms formalize this with a quota hierarchy: an organization (org) owns one or more apps; each app has one or more API keys. Limits cascade downward — an org's daily cap is shared across all its apps, each app's sustained rate is shared across all its keys, and each key has its own burst envelope.

The layered gate model

Visualize a request passing through three gates in sequence. All three gates evaluate simultaneously (in one atomic Redis operation, as you will see), but conceptually each answers a different question.

Request api_key=kA Gate 1 Per-key burst Token bucket rl:key:{kA} e.g. 50 req/s burst Gate 2 Per-app rate Token bucket rl:app:{appId} e.g. 100 req/s sustained Gate 3 Per-org daily Fixed window counter quota:acct:{org}:{date} e.g. 1M calls/day 200 OK 429 scope: key 429 scope: app 429 scope: org All three gates evaluated in one atomic Lua script — pass all or reject at first failure
Fig 1 — The three-gate quota model. A request passes per-key burst, per-app sustained rate, and per-org daily cap in a single atomic Redis evaluation. Any failing gate produces a 429 labelled with the scope of the limit that tripped.

The data model: three types of Redis state

Each layer uses a different data structure because each operates at a different time scale and with a different eviction strategy.

Layer 1 — Per-key token bucket (burst, per-second)

A Redis hash per API key holding two fields: the current token count and the timestamp of the last refill. This is the same lazy-refill token bucket from Lesson rel-03. The key uses the API key string directly so each key is isolated.

# Redis key schema — per-key burst bucket rl:key:{api_key} → HASH { tokens: float, ts: unix_ms } # Example: API key "kA_lv_abc123" rl:key:kA_lv_abc123 → { tokens: 47.0, ts: 1720000000123 } # TTL set to 2× the time to fully refill from empty (auto-expires inactive keys) TTL rl:key:kA_lv_abc123 → 86400 (seconds)

Layer 2 — Per-app token bucket (sustained rate, also per-second)

An identical hash structure, but keyed on the app ID rather than the individual API key. All keys belonging to the same app share this single bucket. If app appX has 10 API keys and each key fires at 20 req/s, they collectively drain the app's 100 req/s sustained bucket together.

# Per-app sustained bucket — shared across all keys of the app rl:app:{app_id} → HASH { tokens: float, ts: unix_ms } rl:app:appX_7f2e → { tokens: 12.3, ts: 1720000000456 } # Multiple API keys of appX all write to this one hash

Layer 3 — Per-org daily quota counter (calendar day)

A plain Redis integer counter keyed on org ID and the calendar date. This is a fixed-window counter with a TTL set to expire at midnight of the next day — not a fixed TTL from creation, but a deadline TTL calculated from the current time. The distinction matters: a key created at 23:00 must expire in 1 hour, not 24 hours.

# Per-org daily quota — integer counter with midnight TTL quota:acct:{org_id}:{YYYYMMDD} → INTEGER (call count today) quota:acct:org_9k1:20240714 → 847392 # Computing seconds-until-midnight for the TTL (inside the Lua script): local t = redis.call('TIME') -- {unix_seconds, microseconds} local now_s = tonumber(t[1]) local secs_since_midnight = now_s % 86400 local secs_until_midnight = 86400 - secs_since_midnight -- UTC midnight -- Set TTL to secs_until_midnight + 300 (5-min buffer for clock skew)
⚠️ Calendar day vs rolling 24 hours — they are not the same limit

A "daily" cap with a calendar-day window resets at a fixed clock time (usually UTC midnight or the org's billing timezone). A rolling 24-hour window resets exactly 24 hours after your first call of the day. The calendar window is simpler to explain to customers ("you get 1 million calls per day") and easier to reason about for billing. But it creates the exact same boundary-burst problem as a fixed-window rate limiter: a well-behaved org that approaches its daily cap by 23:59 UTC can legitimately send 2× the daily cap in under two minutes by straddling midnight. For billing quotas this is typically acceptable — and calendar windows are what HubSpot, Salesforce, and most SaaS platforms use. Understand the trade-off and document your choice in your API reference.

Timezone: UTC vs account timezone

The date key in quota:acct:{org}:{YYYYMMDD} should almost always use UTC for the date string, even if your platform shows "resets daily" to customers in their local timezone. Using account timezone for the Redis key creates a fragmented keyspace (the same org has a different daily key in different time zones) and makes the TTL calculation per-org rather than global. The practical approach: store quota in UTC, but display reset times to customers in their configured timezone.

Under the hood: the atomic multi-limit check

The composition problem has a non-obvious failure mode: if you check each limit in a separate Redis command, you can partially consume one limit before discovering another is exhausted. Imagine checking the key burst bucket first (it passes and you deduct a token), then checking the app bucket (it is full — you must reject). You have now consumed a token from the key bucket for a request that was never served. Repeated partial-consumes slowly bleed quota from the higher layers without the client receiving any successful responses.

The fix is to run all checks in a single Lua script. Redis executes the entire Lua script atomically — no other command runs between steps. The script reads all three counter states, checks all three limits, and only decrements any of them if all three pass.

Lua script starts READ: key bucket, app bucket, org counter key tokens ≥ 1? Gate 1 NO 429 · X-RateLimit-Scope: key YES app tokens ≥ 1? Gate 2 NO 429 · X-RateLimit-Scope: app YES org_count < daily_cap? Gate 3 NO 429 · X-RateLimit-Scope: org YES WRITE: decrement key, app, org → return ALLOW
Fig 2 — The atomic Lua check-all-then-commit flow. All three counter reads happen first; writes only occur when all three gates pass. No partial consumption is possible.

The Lua script: check-all-then-commit

-- Atomic multi-limit check: per-key burst + per-app sustained + per-org daily
-- KEYS[1] = rl:key:{api_key}
-- KEYS[2] = rl:app:{app_id}
-- KEYS[3] = quota:acct:{org_id}:{YYYYMMDD}
-- ARGV[1] = key_capacity (tokens), ARGV[2] = key_refill_rate (tokens/ms)
-- ARGV[3] = app_capacity,          ARGV[4] = app_refill_rate (tokens/ms)
-- ARGV[5] = org_daily_cap,         ARGV[6] = now_ms (from KEYS-side caller)
-- ARGV[7] = secs_until_midnight (for TTL on org key)

local key_bkt  = KEYS[1]
local app_bkt  = KEYS[2]
local org_ctr  = KEYS[3]

local now      = tonumber(ARGV[6])

-- ① READ all three states (no writes yet)
local kt  = redis.call('HMGET', key_bkt, 'tokens', 'ts')
local at  = redis.call('HMGET', app_bkt, 'tokens', 'ts')
local org = tonumber(redis.call('GET', org_ctr)) or 0

-- ② Compute token-bucket state for key layer
local k_tokens    = math.min(tonumber(ARGV[1]),
    (tonumber(kt[1]) or tonumber(ARGV[1])) +
    (now - (tonumber(kt[2]) or now)) * tonumber(ARGV[2]))

-- ③ Compute token-bucket state for app layer
local a_tokens    = math.min(tonumber(ARGV[3]),
    (tonumber(at[1]) or tonumber(ARGV[3])) +
    (now - (tonumber(at[2]) or now)) * tonumber(ARGV[4]))

-- ④ CHECK all three limits — NO writes yet
if k_tokens < 1 then
  return {0, 'key', k_tokens}   -- rejected at key layer
end
if a_tokens < 1 then
  return {0, 'app', a_tokens}   -- rejected at app layer
end
if org >= tonumber(ARGV[5]) then
  return {0, 'org', org}        -- rejected at org daily cap
end

-- ⑤ ALL checks passed — WRITE (commit all three decrements atomically)
redis.call('HSET', key_bkt, 'tokens', k_tokens - 1, 'ts', now)
redis.call('HSET', app_bkt, 'tokens', a_tokens - 1, 'ts', now)
local new_org = redis.call('INCR', org_ctr)
if new_org == 1 then                         -- first call today: set TTL
  redis.call('EXPIRE', org_ctr, tonumber(ARGV[7]) + 300)
end

return {1, 'ok', k_tokens - 1, a_tokens - 1, tonumber(ARGV[5]) - new_org}
✅ Why the script returns the tripping layer

The script returns which scope rejected — 'key', 'app', or 'org'. The application layer translates this into the 429 response: the X-RateLimit-Scope header tells the client which limit tripped so they can act correctly. A key-level rejection means only that key's traffic should slow down. An org-level rejection means all traffic from that organization must pause until the daily counter resets. Without this distinction, clients cannot self-throttle intelligently.

Calendar-day window: the TTL arithmetic

The org daily counter expires at UTC midnight, not N hours after creation. Using a fixed-TTL EXPIRE would let a key created at 18:00 run until 18:00 the next day — not the same billing day. The correct approach is to compute the remaining seconds until the next UTC midnight and use that as the TTL.

00:00 UTC Day N 00:00 UTC Day N+1 All keys expire here 09:00 TTL = 54,000 s (15 h remaining) 18:00 TTL = 21,600 s (6 h remaining) WRONG: fixed EXPIRE 86400 → expires 18:00 next day, not midnight
Fig 3 — Calendar-day TTL arithmetic. Keys created at 09:00 get a 15-hour TTL; keys created at 18:00 get a 6-hour TTL. Both expire at the same UTC midnight. A naive EXPIRE 86400 (24 h) would let a key persist into the next billing day — the counter never resets cleanly at the boundary.

Quota inheritance: org → app → key

The three limits are not independent figures pulled from three separate config files. They form an inheritance tree. An org has a total daily cap; each app is provisioned a fraction of the org's sustained rate; each key under an app gets a personal burst envelope. The numbers should be consistent: if an org has 20 apps each running at full sustained rate, the aggregate must not exceed the org cap.

LevelLimit typeRedis structureTypical config locationSharing model
API keyBurst (token bucket)rl:key:{key} hashKey metadata tableIsolated — not shared with other keys
AppSustained rate (token bucket)rl:app:{appId} hashApp config / plan tierShared across all keys of the app
OrgDaily quota (fixed window)quota:acct:{orgId}:{date} integerBilling / subscription recordShared across all apps and keys of the org

GraphQL query cost weights: a fourth dimension

REST endpoints are roughly uniform in cost: one request = one token consumed. GraphQL mutations and queries vary enormously — a deeply nested query that joins four resolver trees is worth far more than a single field fetch. Shopify and GitHub both expose this as a query cost model: each field has a cost weight, the resolver sums the weights before execution, and the rate limiter deducts that many tokens from the bucket rather than one. This is a fourth layer that applies only to the GraphQL surface.

# GraphQL cost-weighted token deduction (pseudo-code)
function graphql_cost(document):
  cost = 0
  for each field in document:
    cost += COST_TABLE.get(field, 1)         # default: 1 point per field
    if field is a list:
      cost += estimated_items × child_cost   # pagination multiplier
  return cost

# Instead of deducting 1 token, deduct the query's cost
# Lua ARGV[1] becomes 'cost' instead of always '1'

The 429 contract: which limit tripped?

Returning a bare 429 is not enough when multiple limits coexist. A client needs to know three things: which scope rejected them, how long to wait, and what the state of the other limits is. This informs how they should throttle — a key-level rejection might mean only slowing that one integration thread, while an org-level rejection means all API traffic from the organization must pause until the daily counter resets at midnight.

HTTP/1.1 429 Too Many Requests Content-Type: application/json Retry-After: 3600 # seconds until the tripping limit resets # Which scope tripped: X-RateLimit-Scope: org # one of: key | app | org # State of each layer at rejection time: X-RateLimit-Key-Limit: 50 X-RateLimit-Key-Remaining: 38 X-RateLimit-App-Limit: 100 X-RateLimit-App-Remaining: 71 X-RateLimit-Org-Daily-Limit: 1000000 X-RateLimit-Org-Daily-Remaining: 0 X-RateLimit-Org-Reset: 1720051200 # Unix timestamp of next UTC midnight { "error": "quota_exceeded", "scope": "org", "message": "Daily API quota for your organization has been reached. Resets at 2024-07-15T00:00:00Z.", "retry_after": 3600 }
🎯 Interview angle — composing multiple rate limits

"How would you design an API platform with per-key, per-app, and per-org rate limits?" Interviewers are probing whether you recognize the partial-consume problem. Lead with: "Three independent Redis keys; one atomic Lua script that reads all three, checks all three, and only writes if all pass — returning which scope blocked." Then add: "The 429 response must name the scope so clients can route throttling correctly." Candidates who propose three sequential Redis checks (with separate INCR calls) have the logic right but are missing the race condition — point it out yourself to show you understand the subtlety.

By the numbers

The scenario

An organization has 20 apps, each configured for 100 req/s sustained and a per-key burst of 50 req/s. The org daily cap is 1,000,000 calls/day. All apps run their workers continuously at full tilt from 08:00 UTC.

The governing formula: which limit binds first?

At full sustained rate, the aggregate throughput across all apps is:

aggregate_rate  = num_apps × app_sustained_rate
                = 20 × 100 req/s
                = 2,000 req/s                    (modeled)

At that rate, the org's daily cap is exhausted in:

time_to_cap     = org_daily_cap / aggregate_rate
                = 1,000,000 / 2,000
                = 500 s ≈ 8 min 20 s              (modeled)

The daily cap is the binding constraint — it trips after about 8 minutes and 20 seconds of all apps running at full speed. From that point every request receives 429 scope: org and must wait until midnight UTC for the counter to reset.

Timestamped trace: which limit fires and when

t (min)EventKey tokensApp tokensOrg calls usedDecision
0:00All 20 apps start firing at 100 req/s each (2,000 req/s total)50 (full)100 (full)0ALLOW · all gates pass
0:01First 120,000 calls consumed (2,000/s × 60 s)~50 (refilling continuously)~100120,000ALLOW · org gate still open
4:10500,000 calls consumednormalnormal500,000ALLOW · halfway to org cap
8:201,000,000 calls consumed — org cap exhaustednormalnormal1,000,000429 · scope: org
8:20 → 00:00All requests from this org rejected at org gatefills up (no consumption)fills up1,000,000 (frozen)429 · Retry-After: 86400 − (t − 00:00)
next 00:00 UTCorg counter TTL expires; Redis key deletednormalnormal0 (new day)ALLOW again

Sizing the daily cap correctly

Work backwards from the business intent. If the goal is "20 apps can each run at full speed for an 8-hour business day," the needed daily cap is:

needed_cap = num_apps × app_rate × business_hours × 3600
           = 20 × 100 × 8 × 3600
           = 57,600,000 calls/day                (modeled)

At 1,000,000/day the org is undersized by a factor of 57. In practice, platforms set the daily cap at a level that prevents runaway loops and billing abuse, not at the theoretical maximum burst. They also tier it: a free org might get 10,000/day while an enterprise org gets 50,000,000/day.

Retry-After math for org-level rejections

retry_after = secs_until_midnight − secs_since_00:00_UTC
            = (24 × 3600) − (current_unix % 86400)
            = 86400 − (now % 86400)              (formula)

# At 08:20 UTC (30,000 s into the day):
retry_after = 86400 − 30000 = 56,400 s ≈ 15.7 hours

Trade-offs: design decisions you will face

DecisionOption AOption BRecommendation
Daily window type Calendar day (UTC midnight) — simple, matches billing cycles Rolling 24 hours — no boundary burst; resets exactly 24 h after first call Calendar day for billing quotas (customers expect it); rolling 24 h for fairness-sensitive API surfaces
Burst limit check: check-then-commit vs commit-then-rollback Check-all-then-commit (Lua) — atomic; no partial consume; correct Commit-then-rollback — decrement one counter, then check the next; rollback on failure Always check-all-then-commit. Rollbacks add complexity, race conditions, and can leave counters in inconsistent states under failure
Timezone for daily window UTC for all orgs — single keyspace, simple TTL math Account timezone — resets at local midnight; feels natural to customers Store and compute in UTC; translate to account timezone only for display. Mixing timezones in Redis keys creates correctness bugs
Limit granularity Three-layer (key/app/org) — precise isolation; correct scoped 429 Single global per-org limit — simple; one Redis key per org Three layers if you have multiple apps per org; single layer is sufficient for single-app integrations
Cost-weighted counting (GraphQL) Query cost model — accurate cost attribution; incentivizes efficient queries Flat count — simple; one token per HTTP request regardless of complexity Use query cost for GraphQL APIs where query complexity varies significantly; flat count is correct for REST

How real platforms do it

Every major API platform publishes its rate limit model in its developer documentation. The patterns here are drawn from those public references.

PlatformPer-request windowPer-app windowDaily capGraphQL cost?Scope in 429?
HubSpot No per-key burst separate from app 10-second sliding window, ~100–190 req / 10 s depending on tier 250,000–1,000,000 req/day per portal (org); headers: X-HubSpot-RateLimit-Daily-Remaining No (REST only) Partial — X-HubSpot-RateLimit-Daily vs burst differ in response body message
GitHub No per-key burst layer 5,000 req/hour (primary, authenticated REST) No explicit daily cap separate from hourly; secondary limits apply on burst patterns Yes — GitHub GraphQL exposes a point cost per query with a 5,000-point/hour budget Yes — X-RateLimit-Remaining goes to 0; secondary 429 includes Retry-After
Shopify No per-key layer (one key per store) REST: leaky bucket 40 units, refill 2/s; GraphQL: 1,000-point bucket, refill 50 pts/s No hard daily cap on REST; GraphQL cost resets as bucket refills Yes — each GraphQL field has a cost weight; mutations cost more than queries Yes — REST returns X-Shopify-Shop-Api-Call-Limit: 32/40; GraphQL returns cost data in extensions
Stripe Token bucket per API key (live/test mode separate); ~100 read req/s, 100 write req/s Concurrent-request limiter (not rate-per-second) as second layer; fleet-wide load shedder as third No published daily cap; abuse detected via anomaly heuristics No (REST only) Yes — 429 response body distinguishes per-account rate from load-shedding events
✅ Link to related lessons

The single-layer token-bucket algorithm behind all these platforms is covered in detail in rel-03 Rate Limiting Algorithms. For an interactive demo of per-key isolation, open the rate limiter simulator (sim-01) and add a second tenant to observe noisy-neighbour behaviour.

⚠️ The thundering-herd org reset

When an org's daily cap resets at UTC midnight, every paused client resumes simultaneously. If the org has 20 apps each backed-up with queued requests, the burst into the platform at 00:00:00 UTC can dwarf the sustained rate. Mitigate with: (1) staggered retry-after values — add a random 0–300 second jitter to the exact midnight reset; (2) a startup burst cap on the org counter that phases in over the first 60 seconds of a new day; (3) monitoring midnight UTC on your dashboards as a predictable spike.

🧠 Quick check

1. A request passes the per-key burst check (key tokens = 14) and the per-app sustained rate check (app tokens = 37), then hits the per-org daily cap. The Lua script has already decremented both the key and app token buckets. What is the correct behaviour?

Check-all-then-commit means the Lua script reads all three states, checks all three conditions, and only writes (decrements) if all three pass. If the org cap fails, no write to any layer occurs. The described scenario — partial decrements followed by an org rejection — is the commit-then-check anti-pattern. Using a Lua script makes this impossible because the script is atomic.

2. An org daily cap uses a calendar-day window with TTL set to expire at UTC midnight. The org's first API call today arrives at 23:55 UTC. What TTL should the Redis key receive?

The TTL must be the remaining seconds until the next UTC midnight from the moment the key is created. A call at 23:55 UTC gets roughly 300 seconds until 00:00. Using a fixed 86,400-second TTL would let this key persist until 23:55 the following day — the counter would never reset cleanly at midnight and the customer's billing cycle would be off by up to 24 hours.

3. A 429 response includes the header X-RateLimit-Scope: app. Which traffic pattern should the client use to recover?

An app-scope rejection means the per-app sustained-rate bucket is empty. All keys belonging to that app share the same rl:app:{appId} bucket — so any of them consuming a token will drain it further. The correct response is to throttle all keys of that app until the bucket refills. Switching to another key of the same app just hits the same exhausted bucket.

4. An org runs 20 apps, each allowed 100 req/s sustained. The org daily cap is 1,000,000 calls. All 20 apps fire at full speed from 08:00 UTC. At what time does the org hit its daily cap?

Aggregate rate = 20 apps × 100 req/s = 2,000 req/s. Time to exhaust 1,000,000 calls = 1,000,000 / 2,000 = 500 seconds = 8 minutes and 20 seconds after 08:00 UTC. This illustrates why the daily cap is often the binding constraint for large organizations, not the per-key burst limit.

🏗️ Exercise — Design the quota system for a multi-tenant SaaS API
Scenario You are building the rate limiting layer for a SaaS platform. Requirements: each org has a daily call quota enforced at UTC midnight. Each org has multiple apps; each app has a sustained req/s limit. Individual API keys have a burst limit. The 429 response must tell the client which limit tripped. The system must handle 500 concurrent apps without partial-consume bugs.

Model answer:

  1. Data model: Three Redis structures per request. (1) rl:key:{api_key} hash with tokens and ts for the per-key token bucket. (2) rl:app:{app_id} hash with same fields for per-app bucket. (3) quota:acct:{org_id}:{YYYYMMDD} integer counter with a TTL calculated as 86400 − (now_unix % 86400) seconds (plus a 5-minute buffer).
  2. Atomic multi-limit check: A single Redis Lua script. Read all three states with HMGET and GET. Compute refilled token counts for both buckets. Check all three limits (key tokens ≥ 1, app tokens ≥ 1, org count < daily_cap). If all pass, write all three (HSET key bucket, HSET app bucket, INCR org counter). Return the tripping scope on failure, the remaining quotas on success.
  3. 429 response: Include X-RateLimit-Scope (key / app / org), Retry-After, per-layer remaining headers, and the reset timestamp for the tripping layer. For org-level: Retry-After = 86400 − (now_unix % 86400). For key/app: Retry-After = ceil((1 − tokens) / refill_rate).
  4. Configuration resolution: Before calling the Lua script, look up the key's burst params (from key metadata store), the app's sustained rate (from app config), and the org's daily cap (from subscription record). Pass these as Lua ARGV to avoid hardcoding limits in the script.
  5. Observability: Emit a metric tagged with scope (key/app/org) for each 429. Alert if org-level rejections start before noon UTC — it means the org's daily cap is undersized for their usage pattern.

Rubric: ✓ Three distinct Redis structures named ✓ Atomic Lua check-all-then-commit (not three sequential INCR calls) ✓ Calendar-day TTL arithmetic (not fixed 86400 TTL) ✓ Scoped 429 headers ✓ Configuration lookup before script call ✓ Observability hook. Five or more = strong answer. The most common omission is the TTL arithmetic and the scope header.

Key takeaways

Sources & further reading