Production at Scale · Walkthrough 02

Payments at 100M/day (Stripe-style, modeled)

Payments are the system design problem where "mostly correct" is not acceptable. At 100M charges per day, every architectural decision must preserve the invariant that money never disappears and is never double-counted — even when servers crash, networks partition, and retries storm the database.

⏱ 20 min Difficulty: advanced Prereq: Scaling to 100M req/day, Idempotency

By the end you'll be able to

Size an idempotency-key store for 100M charges/day and explain why it must be durable, not cached.
Describe the charge state machine and identify the transitions where double-charge risk is highest.
Explain why payments demand strong consistency, name the four rate-limiter layers, and work through webhook fan-out queue math.

The baseline arithmetic

We carry the same numbers from Walkthrough 01 into this domain-specific lens. The request rate is the same; what changes is the correctness constraint on every single one of those requests.

⚠️ Modeled, not measured — all figures in this lesson

This walkthrough models a hypothetical Stripe-style payments system from first principles. Stripe's actual internal architecture, throughput numbers, storage estimates, and operational parameters are proprietary and not public. Every figure below is derived from publicly available data (memory costs, API docs, the Stripe engineering blog) and labeled as an illustrative model. Do not treat any number here as a benchmark for Stripe or any other payment processor.

-- Baseline rates --
100,000,000 charges/day
  ÷ 86,400 s/day
  = 1,157 charges/s average  ≈ 1,160/s

Peak (3–5× average for payment spikes — e-commerce flash sales):
  1,160 × 5 = 5,800 charges/s peak  ⚠️ illustrative

-- Breakdown of API calls per charge (illustrative) --
Each charge attempt may involve:
  1 × POST /v1/charges     (or PaymentIntents equivalent)
  1 × GET  /v1/charges/:id  (polling for status)
  ~0.3 retries on average   (network errors, idempotent retries)
Total API calls ≈ 100M × 2.3 = ~230M API calls/day  ⚠️ illustrative
  = 2,660 API calls/s avg, ~13,000/s peak

Correctness at scale: why payments are not just another API

A typical read-heavy API can tolerate occasional cache staleness — a user's follower count being 15 seconds old is imperceptible. A payments API cannot. Consider what happens if you relax correctness:

Duplicate charges: A retry after a network timeout without idempotency protection charges the customer twice. At 100M charges/day with ~0.3% network-level retries (illustrative), that is ~300,000 potential duplicates per day — a catastrophic customer-trust and regulatory problem.
Stale balance reads: If a merchant's balance read hits a lagging replica and the read says "you have $5,000" when the actual balance (on the primary) is $0, an authorization based on that read approves a fraudulent or overdrafted charge.
Lost writes: If a charge write is accepted by the app tier but lost before reaching the DB (crash, network partition), the customer was charged (the card network processed it) but the merchant has no record of receiving money. This is a financial reconciliation disaster.

These failure modes define the architectural constraints: idempotency for every mutation, strong consistency for balance reads and writes, and durable state for every charge record before confirming to the caller.

The charge state machine

A charge is not a single atomic event — it is a progression through states, each representing a point of no return. The state machine defines where you can safely retry and where you cannot.

The charge state machine. The PROCESSING window is where double-charge risk peaks: the card network has received the authorization request, but the response has not yet been written back to the database. A crash or timeout here requires careful reconciliation — not a naive retry.

The critical insight is the PROCESSING window. Once the authorization request has been sent to the card network, you cannot simply retry it — the network may have already approved and charged the card. Retrying would create a second charge. Payments systems handle this by writing the PROCESSING state to the database before sending to the network, so that on crash-recovery the system knows to query the network for the outcome rather than re-sending.

Idempotency keys: the cornerstone of safe retries

An idempotency key is a caller-supplied identifier that the server uses to deduplicate requests. If the same key arrives twice, the server returns the stored result of the first request — no second charge. This is the mechanism that makes retries safe. See Lesson rel-02 for the full pattern.

Idempotency key store sizing

-- Key store sizing (illustrative) --
⚠️  Modeled, not measured

Volume: 100,000,000 charges/day
TTL:    24 hours (typical; some systems use 7 days)

-- Per-key storage estimate --
Key components (illustrative):
  idempotency_key string: ~64 bytes  (UUID or caller-supplied, URL-safe)
  request_hash:           ~32 bytes  (SHA-256 of request body for safety check)
  response_body:          ~800 bytes (serialized charge object JSON)
  metadata:               ~104 bytes (timestamps, status, pointers)
Total per key:            ~1 KB

-- Total storage at any point in time --
Keys live for 24 h TTL: 100M keys × 1 KB = 100 GB  ⚠️ illustrative

If TTL extends to 7 days: 700M keys × 1 KB = 700 GB

-- Lookup throughput required --
Every incoming charge API call requires 1 idempotency lookup:
  13,000 API calls/s peak → 13,000 key lookups/s
  Redis handles ~100,000–500,000 ops/s on commodity hardware ✓
  But: keys must be in DURABLE storage (Redis + AOF or a DB), not
  ephemeral cache — a cache eviction that loses a key could re-enable
  a duplicate charge.

⚠️ Never store idempotency keys in a pure LRU cache

An LRU cache may evict a key under memory pressure — and an evicted key is invisible to the dedup check, so the next retry creates a second charge. The idempotency store must be durable: either a database, or Redis with append-only-file (AOF) persistence and replication. The ~100 GB (at 24h TTL, illustrative) is entirely feasible on modern Redis — but it must survive a server restart.

Strong consistency: why eventual reads are prohibited on balances

In Walkthrough 01, we observed that read replicas with their replication lag are acceptable for most workloads. Payments are the exception. Consider a merchant balance read for an authorization decision:

Scenario

Merchant balance on DB primary: $0 (last payout drained it).
A replication-lagged replica still shows: $5,000 (the previous day's balance).
An authorization check reads from the replica → approves a $200 charge.
Result: merchant is overdrawn; payment processor absorbs the loss.

The fix is simple to state and expensive to implement: every balance read that informs an authorization decision must go to the primary. This is why the bottleneck in a payments system is almost always the database primary write path, not the application tier. The app servers are stateless and scale horizontally; the DB primary does not.

-- Primary DB load from payments (illustrative) --
⚠️  Modeled, not measured

Per charge, primary DB operations (illustrative):
  1 × idempotency key write     (INSERT on arrival)
  1 × charge record INSERT      (PENDING state)
  1 × balance read              (SELECT FOR UPDATE — strong read)
  1 × charge state UPDATE       (PENDING → PROCESSING)
  1 × charge state UPDATE       (PROCESSING → SUCCEEDED/FAILED)
  1 × balance UPDATE            (credit or rollback)
  1 × idempotency key UPDATE    (store response for dedup)
  ~7 primary DB operations per charge (illustrative)

At 1,160 charges/s avg:
  1,160 × 7 = 8,120 primary DB ops/s (illustrative)

At 5,800 charges/s peak:
  5,800 × 7 = 40,600 primary DB ops/s (illustrative)
  → Well into the territory where vertical scaling alone is insufficient.
  → Requires: fast NVMe SSD, connection pooling (PgBouncer), and
    possibly partitioning by merchant_id to reduce lock contention.

🎯 Interview angle

"Where is the bottleneck in a payments system at scale?" — the answer interviewers want is the database primary, not the app tier. The app tier is stateless and trivially horizontal-scaled. The DB primary must handle every write serially, and payments prohibit routing reads to replicas for authorization decisions. The follow-up question is almost always "how do you handle that?" — and the correct answer involves a combination of connection pooling, vertical scaling of the primary, ledger partitioning, and eventually a custom ledger service rather than a general-purpose RDBMS.

Four cooperating rate limiters (link: Inside Stripe)

A payments API faces a distinctive rate-limiting challenge: some callers are legitimate merchants with genuinely high volume; others are fraudulent actors probing for valid card numbers. The architecture described in the Stripe rate-limiter engineering blog post layers four controls that together handle both concerns. This model is inspired by that post; the specific parameters and implementation are illustrative.

Four cooperating rate limiters. Each guards against a different failure mode. Together they protect the fleet from both malicious actors and accidental runaway clients, without a single limiter needing to solve all problems. Inspired by the Stripe rate-limiters post — specific parameters are illustrative.

The four layers are:

Request-rate limiter (per API key, per second): The outer gate. Counts requests per key in a sliding window or token bucket. Blocks brute-force card scanning where an attacker submits thousands of charge attempts per second with different card numbers.
Concurrent-request limiter (per API key, in-flight count): Caps the number of unresolved requests a single key may have open simultaneously. A retry storm — a client that retries every inflight request on a 500-ms timer — can make 10× normal DB load even if each individual request rate looks acceptable. Capping concurrency stops this class of problem.
Fleet-protection circuit breaker (global signal): When the overall system is under dangerous load (DB connection pool saturation, p99 latency climbing toward timeout thresholds), this layer starts shedding a percentage of lower-priority requests. It is the last line of defense before a total system outage.
User-defined limits (per API key, set by merchant): Merchants can configure caps on their own keys — for example, a merchant may cap their key at 100 charges/minute to prevent an application bug from running up unexpected costs. This is a trust and safety feature as much as a technical one.

✅ Why four limiters rather than one?

A single global rate limiter must set a threshold that is both high enough for legitimate high-volume merchants and low enough to stop attackers — an impossible combination. Layering allows each limiter to be simple and well-targeted. The request-rate limiter can be generous because the concurrent-request limiter stops runaway storms; the fleet protection circuit breaker handles global overload without the per-key limiters needing to anticipate it.

Webhook fan-out: every charge generates downstream work

Each completed charge emits webhook events — charge.succeeded, payment_intent.succeeded, balance.updated — to every URL the merchant has registered. This fan-out is a significant workload in its own right, often exceeding the charge API throughput. See Lesson rel-12 for the full webhook architecture.

-- Webhook fan-out math (illustrative) --
⚠️  Modeled, not measured

Charges/day: 100,000,000
Events per charge (illustrative): ~3 distinct event types
Webhook endpoints per merchant (avg, illustrative): ~1.5
  (some merchants register multiple endpoints for
   different environments or services)

Total webhook deliveries/day:
  100M × 3 events × 1.5 endpoints = 450M deliveries/day  ⚠️ illustrative
  = 5,208 deliveries/s avg, ~26,000/s peak

-- Retry load from delivery failures --
HTTP delivery failure rate (illustrative): ~5%
  (merchant endpoint down, slow, timeout)
Retries per failed delivery: ~5 (with exponential backoff)
Additional retry deliveries/day:
  450M × 5% × 5 retries = 112.5M retry attempts/day
Total queue throughput including retries: ~560M operations/day  ⚠️ illustrative
  = 6,480 queue ops/s avg, ~32,400/s peak

-- Queue depth under failure (illustrative) --
If a large merchant endpoint goes down for 5 minutes:
  Undelivered webhooks during outage:
    5,208 /s avg × 300 s = ~1,562,400 queued deliveries
  Time to drain at 5,208/s (no new deliveries):
    1,562,400 ÷ 5,208 ≈ 300 s = 5 minutes additional drain time
  Peak queue depth: ~1.5M messages  (manageable for SQS/Kafka)

⚠️ Webhook storms after merchant outages

When a merchant's endpoint comes back online after a multi-hour outage, the retry queue for that merchant may contain millions of pending deliveries that all become eligible at once. Without per-merchant rate limiting on delivery, this creates a thundering herd that overwhelms both the queue workers and the merchant's server. Effective systems maintain per-merchant delivery concurrency limits and back-off curves that don't fully open on reconnect.

Where the real bottleneck is: the ledger, not the app tier

Stack all the constraints together and the architectural picture becomes clear:

Layer	Scaling approach	Bottleneck?	Why / why not
App servers (charge API)	Horizontal — stateless, add nodes	No	Stateless; each request is independent; LB distributes evenly
Rate-limiter store (Redis)	Redis cluster (sharded by key)	Rarely	Redis handles 100k+ ops/s; horizontal sharding distributes key space
Idempotency key store	Durable Redis (or DB) + TTL eviction	Rarely	Reads and writes are single-key lookups; ~100 GB at 24h TTL is feasible
Charge DB primary (ledger)	Vertical first; then partitioning by merchant_id; eventually custom ledger	Yes — the primary bottleneck	Strong-consistency reads + writes; 7 ops/charge × 5,800/s peak = ~40k DB ops/s; cannot freely add read replicas for auth decisions
Webhook queue & workers	Queue (SQS/Kafka) + autoscaled worker fleet	No	Queue absorbs spikes; workers scale independently; failure retries managed by queue
Card network calls	Not controlled — gateway capacity purchased	External dependency	Timeouts and circuit breakers protect the internal system from a slow network

🎯 Interview angle

When asked "how does Stripe scale?", the conceptual answer interviewers look for is: the app tier is trivially scaled horizontally; the hard problem is the DB primary / ledger which must remain consistent and handle all writes. The standard progression is: vertical scaling → connection pooling → merchant-id partitioning → purpose-built ledger (possibly eventually-consistent for aggregation views, with strong consistency only at the authoritative charge table). The idiom "the app tier is not the bottleneck" is a key differentiator in payment system design discussions.

The complete payments architecture

The full payments architecture. The DB primary is the central chokepoint: it must handle strong-consistency reads for authorization and all charge-state writes. Every other tier is stateless and horizontally scalable. Webhooks fan out asynchronously to avoid blocking the charge path.

✅ Separate the write path from the fan-out path

The charge write path (idempotency check → DB write → card network → DB update) must be synchronous and strongly consistent — the caller needs a definitive answer. The fan-out path (webhook delivery, accounting aggregations, analytics) should be asynchronous. Mixing them on the same DB write transaction adds latency and failure surface to the critical path for no benefit.

🧠 Quick check

1. What is the approximate storage requirement for an idempotency key store at 100M charges/day with a 24-hour TTL, assuming ~1 KB per key?

100M keys × 1 KB = 100 GB. This is entirely feasible on modern Redis with AOF persistence — and it must be durable, not an LRU cache, because an evicted key can allow a duplicate charge.

2. Why must balance reads for authorization decisions go to the DB primary rather than a read replica?

Replication lag — even of milliseconds — is enough to return a balance that has already been fully spent. Any authorization decision made on a stale replica read risks approving a charge the merchant cannot cover. Strong consistency requires reading from the authoritative source: the primary.

3. In a payments system at 100M charges/day, which layer is the primary scaling bottleneck?

App servers are stateless and horizontally scaled trivially. Redis handles 100k+ ops/s and can be sharded. The DB primary must process all charge writes and all strong-consistency reads — at ~7 ops per charge × peak rate, this is ~40k DB ops/s (illustrative). This is the hard constraint that determines the architecture.

4. The concurrent-request rate limiter (layer 2) specifically addresses which failure mode?

A retry storm can produce a 10× load spike even when the per-second request rate looks normal — because many requests are in-flight simultaneously and all get retried together. Capping concurrent in-flight requests per key stops this class of problem without restricting the request rate itself.

✍️ Exercise: model a payments system for a smaller scale

A new payment processor expects to handle 10M charges/day. Using the same per-charge assumptions as this lesson (7 primary DB ops/charge, 5× peak multiplier, 3 webhook events per charge, 1.5 endpoints per merchant), calculate: (a) average and peak charge req/s; (b) primary DB ops/s at average and peak; (c) total webhook deliveries/day and peak delivery rate/s.

Then answer: is DB sharding necessary at this scale? At what scale (charges/day) would you begin evaluating it?

Model answer:

(a) Charge rates: 10M ÷ 86,400 = 116 charges/s avg. Peak: 116 × 5 = 580 charges/s.
(b) Primary DB ops/s: Avg: 116 × 7 = 812 ops/s. Peak: 580 × 7 = 4,060 ops/s. A well-tuned PostgreSQL primary on NVMe SSD can handle 10,000–50,000 simple ops/s (illustrative). No sharding required yet — but connection pooling (PgBouncer) and query optimization are important at 4,060 ops/s peak.
(c) Webhook deliveries: 10M × 3 × 1.5 = 45M deliveries/day = 521/s avg. Peak: 521 × 5 = 2,605/s. Including 5% failure × 5 retries: ~56M total ops/day.
Sharding decision: At 10M charges/day the primary is comfortably within limits. Sharding becomes worth evaluating when write QPS approaches the single-primary ceiling — roughly at 100M–500M charges/day (illustrative), or when write latency begins climbing under load-test. A better early investment is merchant-id partitioning within a single DB to reduce hot-row contention.

Rubric: Full marks for correct arithmetic in all three parts and a reasoned sharding answer tied to write-QPS data rather than an arbitrary threshold. Partial marks if the arithmetic is correct but the sharding answer lacks a trigger metric. Bonus for noting that connection pooling (not more hardware) is the first fix when primary ops/s is elevated.

Key takeaways

100M charges/day = ~1,160 avg req/s, ~5,800 peak. Payments add a correctness requirement: every one of those requests must be deduplicated, strongly consistent, and auditable.
The idempotency key store enables safe retries. At 100M charges/day with a 24h TTL and ~1 KB/key, it requires ~100 GB of durable storage — not an LRU cache.
The charge state machine defines where retries are safe (before network send) and where they must become reconciliation (after the card network has been contacted).
Strong consistency is non-negotiable for balance reads used in authorization decisions. This is why the DB primary is the primary bottleneck — read replicas cannot be freely used for this workload.
Four cooperating rate limiters — request-rate, concurrent-request, fleet-protection, and user-defined — each guard against a distinct failure mode. No single limiter can do all four jobs.
Webhook fan-out can exceed API throughput significantly. At 100M charges/day, modeling shows ~450M–560M webhook deliveries/day (including retries). This is an async workload decoupled from the synchronous charge path.
All figures are illustrative models. Validate with load tests on your own stack before making provisioning decisions.