API Design

Platform & API Product Engineering · Lesson 02

Designing a Webhook Delivery System

A webhook is not just "POST to a URL when something happens." It is a delivery subsystem with fan-out, per-endpoint queuing, HMAC-signed payloads, exponential-backoff retries, dead-letter queues, circuit-breaking for failing endpoints, and a replay API. Each of those pieces exists to solve a specific failure mode — skip one and your platform will eventually wake engineers at 3 AM.

⏱ 30 min Difficulty: advanced Prereq: Webhooks intro (rel-12), Event-driven pub-sub (rel-10)

By the end you'll be able to

The problem a webhook system must solve

When your platform has ten thousand apps each subscribed to order.created events, and you receive 1,000 orders per second, your delivery system must POST to ten thousand distinct URLs at a sustained rate without any single slow or unresponsive endpoint holding up the other 9,999. That sentence describes the hard part: isolation. If your system processes deliveries with a single shared queue and worker pool, one endpoint that takes 30 seconds to respond will occupy a worker thread for 30 seconds, starving deliveries to healthy endpoints.

Think of it like postal sorting. A single queue where every package, from every sender to every recipient, must pass through one conveyor belt means a stuck package blocks the entire line. Real postal systems sort by destination and run separate belts per route. Webhooks need the same: a separate queue per endpoint, so that a dead endpoint accumulates its undeliverable backlog without touching anyone else's.

There is a second hard problem: at-least-once delivery with idempotent consumers. Your delivery system will retry failed deliveries. Retries mean duplicate deliveries are possible. Consumers must handle receiving the same event payload twice without double-processing — an order event delivered twice should not charge a customer twice.

The full delivery pipeline

Event produced order.created Subscription Matcher lookup by event_type Queue · ep_A 3 pending Queue · ep_B 1 pending Queue · ep_C 847 pending ⚠ ep_A 200 OK ✓ ep_B 200 OK ✓ ep_C timeout / 5xx Retry queue DLQ (after N) Replay API — re-enqueues from DLQ on demand POST + HMAC sig POST + HMAC sig
Fig 1 — Full webhook delivery pipeline. A single event fans out to one queue per subscribed endpoint. ep_C's mounting backlog (847 items) does not affect ep_A or ep_B — per-endpoint queue isolation. Failed deliveries retry with backoff; after N attempts they move to the DLQ. The replay API can re-enqueue DLQ items on demand.

Step 1 — Event production and durable storage

Every event must be written to durable storage before anything downstream happens. If you enqueue directly to an in-memory queue and the process crashes, the event is lost. The correct order is: write the event to a durable store (a database table or an event log like Kafka), then return 200 to the caller who triggered it, then asynchronously fan-out to subscriptions.

# Event creation — pseudo-code for the API server that produces events
function create_order(customer_id, items):
  order = DB.insert_order(customer_id, items)        # primary write

  event = {
    "id":         generate_uuid(),                    # stable deduplication key
    "type":       "order.created",
    "api_version": "2024-07-01",
    "created_at": now_iso8601(),
    "data":       { "order_id": order.id, ... }
  }

  DB.insert_event(event)                             # durable event record
  QUEUE.push("events:fan_out", event.id)             # fan-out job (by id, not full payload)
  return order

Notice the fan-out queue stores the event ID, not the full payload. When the fan-out worker runs, it reads the event from the durable store. This means requeuing (on failure) does not create duplicate records, and the fan-out worker always reads the canonical event state.

Step 2 — Subscription matching and fan-out

The fan-out worker reads each event ID, fetches the event from the store, then queries for all active subscriptions matching that event type. For each matching subscription it writes one delivery job to that endpoint's dedicated queue.

# Fan-out worker
function fan_out_event(event_id):
  event         = DB.get_event(event_id)
  subscriptions = DB.find_subscriptions(
                    event_type=event.type,
                    status='active'
                  )                                  # index on (event_type, status)

  for sub in subscriptions:
    delivery = DB.insert_delivery({
      "id":          generate_uuid(),
      "event_id":    event.id,
      "endpoint_id": sub.endpoint_id,
      "attempt":     1,
      "status":      "pending",
      "next_attempt_at": now()
    })
    QUEUE.push("webhook:deliver:" + sub.endpoint_id, delivery.id)

The key naming convention — webhook:deliver:{endpoint_id} — means each endpoint gets its own queue. A worker process consumes from one endpoint's queue at a time, so a slow endpoint only blocks that endpoint's worker, not the global pool.

Per-endpoint isolation and circuit-breaking

This is the most important architectural property of a robust webhook system. One slow or unreachable endpoint must not stall deliveries to healthy endpoints. Per-endpoint queues provide isolation; circuit-breaking provides relief for both the platform and the endpoint owner.

Queue · ep_A depth: 2 items status: HEALTHY → 200 OK ✓ delivered in <1 s Queue · ep_B depth: 5 items status: HEALTHY → 200 OK ✓ delivered in <1 s Queue · ep_C depth: 847 items status: CIRCUIT OPEN No delivery attempts Circuit open — endpoint owner notified ep_C's 847-item backlog does not affect ep_A or ep_B Per-endpoint queues provide the isolation boundary Email → endpoint owner
Fig 2 — Per-endpoint queue isolation. ep_C has accumulated 847 undeliverable items because the endpoint is timing out. The circuit breaker has opened: no further delivery attempts are made, the queue stops growing, and the endpoint owner is notified by email. ep_A and ep_B are completely unaffected.

Circuit-breaker rules for endpoints

A circuit breaker for an endpoint is simpler than a service mesh circuit breaker — you are tracking one consumer's reliability over time, not a backend service. A practical implementation:

# Circuit breaker state — per endpoint
endpoint_state = {
  "status":            "closed",  # closed = active, open = disabled
  "consecutive_failures": 0,
  "opened_at":         null,
  "last_attempt_at":   null,
  "success_rate_24h":  1.0
}

# After each delivery attempt:
if delivery succeeded (2xx):
  consecutive_failures = 0
  if status == "half-open":
    status = "closed"           # probe succeeded → reopen

elif delivery failed (non-2xx or timeout):
  consecutive_failures += 1
  if consecutive_failures >= 5:
    status = "open"
    opened_at = now()
    NOTIFY_OWNER(endpoint_id)  # email / dashboard alert

# Probing: after circuit has been open for T minutes, try one request
if status == "open" AND now() - opened_at > PROBE_INTERVAL:
  status = "half-open"           # allow one probe attempt
✅ Auto-disable vs silently drop

When a circuit opens, do not silently drop deliveries. Continue queueing items (up to a max retention window, e.g., 72 hours) but stop attempting delivery. When the circuit half-opens for a probe and succeeds, replay the queued backlog in controlled bursts. If the endpoint owner fixes their server, they get their events — they do not lose them. Platforms that silently drop events when an endpoint is down generate intense support escalations when the endpoint comes back online and customers discover missing data.

Step 3 — HTTP delivery with HMAC signatures

Every webhook POST must be signed so the receiver can verify the payload came from your platform and has not been tampered with in transit. The standard mechanism is HMAC-SHA256 over the timestamp and raw request body, with the signature included in the request headers.

SENDER (webhook delivery worker) Get endpoint's signing secret from store t = unix timestamp (seconds) sig = HMAC_SHA256(secret, t + "." + body) POST /webhook Webhook-Timestamp: {t} Webhook-Signature: sha256={sig} Content-Type: application/json HTTPS RECEIVER (customer's server) Parse Webhook-Timestamp and body Check: |now − t| ≤ 300 s expected = HMAC_SHA256(secret, t + "." + body) constant_time_equals(expected, provided) → 200 mismatch → 401 (reject)
Fig 3 — HMAC signature generation and verification. The sender signs timestamp + "." + raw_body with the endpoint's secret. The receiver recomputes independently and uses a constant-time comparison to prevent timing-oracle attacks. The timestamp check (within 300 seconds) prevents replay attacks.

The signing algorithm in full

# Sender: compute the signature before POSTing
function sign_payload(secret, body):
  t               = str(current_unix_timestamp())
  signing_input   = t + "." + body              # byte-level concat of timestamp + "." + raw JSON
  sig             = hmac_sha256(key=secret, msg=signing_input)
  sig_hex         = hex(sig)
  return {
    "Webhook-Timestamp": t,
    "Webhook-Signature": "sha256=" + sig_hex
  }

# Receiver: verify before processing
function verify_signature(secret, timestamp, body, provided_sig):
  # 1. Replay defense: reject if timestamp is too old or in the future
  age = abs(current_unix_timestamp() - int(timestamp))
  if age > 300:                                # 5-minute tolerance window
    return REJECT, "timestamp too old or too far in future"

  # 2. Recompute expected signature using same input as sender
  signing_input   = timestamp + "." + body
  expected_sig    = "sha256=" + hex(hmac_sha256(key=secret, msg=signing_input))

  # 3. Constant-time compare — never use == for signatures (timing oracle)
  return hmac_compare_digest(expected_sig, provided_sig)
⚠️ Verify against the raw request body, not a re-serialized object

The signature is computed over the exact bytes of the request body as received. If you parse the JSON body into an object and then re-serialize it before computing the verification HMAC, key ordering or whitespace differences will produce a different byte string and the signature check will fail. Always compute the HMAC over the raw body bytes, before JSON parsing. Stripe explicitly documents this: the signature must be verified against the raw Stripe request data.

Secret rotation: multiple active secrets

Webhook secrets need to be rotatable without breaking in-flight deliveries. The pattern is to allow multiple active secrets per endpoint simultaneously for a short overlap window:

  1. The endpoint owner generates a new secret in the dashboard. Both old and new secrets are now active for this endpoint.
  2. The delivery worker signs with the newest secret but includes the key ID in the header (e.g. Webhook-Signature: sha256={sig},keyId={kid}).
  3. The receiver verifies against all currently active secrets for the endpoint. If any matches, the request is accepted.
  4. After a configurable grace period (e.g., 24 hours), the old secret is retired. The receiver should by then have deployed code using the new secret.

Step 4 — Retry strategy with exponential backoff

Any non-2xx response (4xx excluding 429, 5xx, or a connection timeout) is a delivery failure. The delivery must be rescheduled with an exponential delay so that a struggling endpoint is not hammered into the ground while it recovers.

# Retry schedule — exponential backoff with jitter, per delivery attempt
function next_attempt_delay(attempt, base=5, cap=3600):
  delay_s = min(cap, base * (2 ** attempt))     # exponential: 5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 3600, 3600...
  jitter  = random(0, delay_s * 0.2)           # ±20% jitter to desynchronize retries
  return delay_s + jitter

# On delivery failure:
if delivery.attempt <= MAX_ATTEMPTS:             # MAX_ATTEMPTS = typically 10–17
  delivery.status      = "retrying"
  delivery.next_attempt = now() + next_attempt_delay(delivery.attempt)
  QUEUE.schedule_at(delivery.id, delivery.next_attempt)
else:
  delivery.status = "failed"
  DLQ.push(delivery.id)                          # dead-letter queue
  UPDATE_ENDPOINT_FAILURE_STATS(delivery.endpoint_id)

With base=5, cap=3600, the 10 retry schedule (modeled) is approximately:

AttemptDelay (s)Cumulative time elapsedTotal attempt
1 (initial)00 simmediate
255 s~5 s after first fail
31015 s
42035 s
54075 s (~1.25 min)
71605.2 min
9640~20 min
113,600 (cap)~2 h
17 (max)3,600 (cap)~8 h→ DLQ after this attempt

Stripe retries over approximately 3 days with increasing delays. GitHub retries with increasing delays over 72 hours. The specific schedule is a product decision — short windows mean customers find out about problems quickly; long windows tolerate infrastructure hiccups but delay failure notification.

Step 5 — Dead-letter queue and replay API

After the final retry, deliveries that could not be confirmed successful move to a dead-letter queue (DLQ). The DLQ is not a dustbin — it is a durable log of deliveries that require human attention. The replay API allows endpoint owners to inspect and re-trigger deliveries from the DLQ after they have fixed whatever caused the failures.

# Replay API — allows endpoint owner to re-enqueue failed deliveries
# POST /v1/webhooks/endpoints/{endpoint_id}/deliveries/{delivery_id}/replay

function replay_delivery(endpoint_id, delivery_id, caller_auth):
  delivery = DB.get_delivery(delivery_id)
  assert delivery.endpoint_id == endpoint_id        # ownership check
  assert caller_auth.has_permission("webhooks:write")

  # Create a new delivery attempt linked to the original event
  new_delivery = DB.insert_delivery({
    "event_id":    delivery.event_id,               # same event, new attempt
    "endpoint_id": delivery.endpoint_id,
    "attempt":     1,                               # reset attempt counter
    "status":      "pending",
    "replayed_from": delivery.id
  })
  QUEUE.push("webhook:deliver:" + delivery.endpoint_id, new_delivery.id)
  return new_delivery

Under the hood: idempotency and duplicate handling

At-least-once delivery guarantees that some successful deliveries will result in the customer receiving the same event twice — this is unavoidable without distributed transactions between your delivery confirmation and the customer's processing confirmation. The receiver must handle duplicates without re-processing them.

The mechanism: every event carries a stable id (a UUID). The consumer's handler checks whether it has already processed this ID before acting:

# Consumer: idempotent event handler
function handle_order_created(payload):
  event_id = payload["id"]

  # Attempt insert into a processed-events table with a UNIQUE constraint on event_id
  try:
    DB.insert_idempotency_key(event_id, processed_at=now())
  except UniqueViolation:
    return HTTP_200  # already processed — acknowledge without re-running

  # Safe to process — won't run twice for the same event_id
  create_order_in_crm(payload["data"])
  return HTTP_200
⚠️ Respond 2xx fast, then process asynchronously

Your HTTP handler should return 2xx within 5 seconds (some platforms use a 3-second timeout). If your processing is slow — writing to a database, calling another API — acknowledge the webhook immediately and process the payload in a background job. A slow handler causes the delivery worker to wait, retry, and eventually duplicate the event to your endpoint. The pattern: receive webhook → write payload to your own queue → return 200 → background worker processes queue. This decouples webhook receipt from processing latency.

By the numbers

Delivery throughput and worker sizing

Scenario: 1,000,000 events/day across 20 subscribed apps = up to 20,000,000 deliveries/day. Average HTTP delivery latency to a healthy endpoint: 150 ms.

delivery_rate  = 20,000,000 / 86,400 ≈ 231 deliveries/s   (modeled)

# Little's Law: L = λ × W
# L = average in-flight deliveries (concurrency needed)
# λ = arrival rate (deliveries/s), W = average service time (s)

concurrency    = 231 deliveries/s × 0.150 s = 34.7 → 35 concurrent workers  (modeled)

Each worker handles one delivery at a time (one open HTTP connection). You need approximately 35 worker threads to sustain the base delivery rate at 150 ms average latency. At a 500 ms average (slower endpoints), you need 231 × 0.5 = 116 workers. This is why per-endpoint concurrency caps matter — a single endpoint with 30-second latency would tie up 6,930 workers if uncapped.

Retry amplification

When 10% of endpoints are unhealthy (a realistic scenario during a downstream incident), retry load amplifies the delivery volume:

base_deliveries        = 231/s
unhealthy_endpoints    = 10%
retried_deliveries     = 231 × 0.10 × (MAX_ATTEMPTS - 1)
                       = 23.1 × 16 retries  ≈  370/s additional  (modeled)

total_peak_delivery_rate = 231 + 370 ≈ 600/s   (2.6× amplification factor)
workers_needed_at_peak   = 600 × 0.150  ≈  90 workers

This is the retry storm: an incident affecting 10% of endpoints more than doubles the delivery load on your infrastructure. Mitigations: (1) per-endpoint concurrency caps to prevent one bad endpoint consuming disproportionate workers; (2) circuit-breaking to stop retrying open-circuit endpoints; (3) exponential backoff to spread retries over time rather than concentrating them.

Queue depth as a health signal

# Alert thresholds — model these for your own SLA
warn  when queue_depth > 100     AND p99_delivery_age > 30 s    (modeled)
page  when queue_depth > 1000    OR  p99_delivery_age > 300 s   (modeled)

# Per-endpoint success rate over rolling 1 hour:
success_rate = successful_deliveries / total_attempts
open_circuit when success_rate < 0.50 over last 20 attempts      (modeled)

Trade-offs: design decisions you will face

DecisionOption AOption BRecommendation
Delivery model Push (webhooks) — server POSTs to client on event Pull (polling) — client GETs /events periodically Push for low-latency event notification at scale; pull for simple integrations where latency does not matter and clients do not want to run a server
Queue topology Per-endpoint queues — one queue per registered endpoint URL Shared queue with priority — one queue, endpoint metadata as metadata Per-endpoint queues: one slow endpoint cannot starve others. Essential at scale. Shared queue is simpler for very small platforms (< 10 endpoints total)
Handler contract Respond 2xx fast then process async — acknowledge immediately, queue internally Respond 2xx only after processing — synchronous processing before response Respond fast + process async. Synchronous processing risks timeouts and forced retries. Your handler's internal latency should never affect delivery confirmation
Ordering guarantee No ordering guarantee — events may arrive out of sequence Per-resource sequence IDs — events for a given resource carry a monotonic sequence number; consumer can detect and buffer out-of-order arrivals No ordering is simpler and correct for most use cases. Add per-resource sequence IDs only if consumers need strict ordering (e.g., state-machine transitions where out-of-order is a logic error)
Delivery guarantee At-least-once — some duplicates possible; consumers must be idempotent Exactly-once — requires distributed transactions (two-phase commit) or idempotency keys at the delivery store layer At-least-once with idempotent consumers is the industry standard. Exactly-once is possible but complex — Stripe, GitHub, and Shopify all document "at-least-once" and require consumers to use the event ID for deduplication

How real platforms do it

PlatformSignature schemeRetry scheduleDLQ / replay?Circuit-break?
Stripe HMAC-SHA256 over timestamp.body; secret is endpoint-specific; Stripe-Signature header with multiple scheme prefixes (supports multiple active secrets) Up to ~3 days; uses exponential backoff with increasing delays; documented in Stripe webhook retries docs Yes — dashboard shows each event's delivery attempts; any event can be manually replayed via dashboard or Events API Yes — endpoints that fail repeatedly are automatically disabled; Stripe notifies via email and the dashboard; endpoint owner must manually re-enable
GitHub HMAC-SHA256 over request body; X-Hub-Signature-256 header; single secret per webhook; SHA-1 (X-Hub-Signature) still supported for legacy compatibility Retries delivery after failures; exact schedule not publicly documented; GitHub UI shows recent deliveries with request/response for debugging Yes — GitHub provides a webhook redeliver API endpoint for replaying any past delivery Yes — endpoint auto-disabled after repeated timeouts; notification email sent to org admin
Svix (webhook-as-a-service) HMAC-SHA256 with message-ID-based signing input; supports multiple active secrets for rotation; Svix signature docs Exponential backoff over ~5 days (28 retry attempts); jitter applied to desynchronize Full replay API; event log retained per configured retention window; replay individual or bulk events Yes — automatic endpoint disabling with configurable thresholds; owner notification via portal and email
Shopify HMAC-SHA256 over raw request body; base64-encoded signature in X-Shopify-Hmac-Sha256 header Up to 19 delivery attempts over 48 hours Dashboard shows recent deliveries; manual replay via Partner Dashboard; no public API for bulk replay Yes — app webhook subscription auto-paused after 19 consecutive failures
🎯 Interview angle — designing a webhook delivery system

"Design a reliable webhook delivery system for a platform with thousands of subscribers." A strong answer covers all six components in order: (1) durable event storage before fan-out; (2) subscription matching by event type; (3) per-endpoint queues for isolation; (4) HMAC-signed HTTP POST; (5) exponential-backoff retries with DLQ after N failures; (6) circuit-breaking for dead endpoints. Then add the harder parts: at-least-once with idempotent consumers; the secret-rotation pattern; the thundering-herd on circuit-close (drain backlog in controlled bursts, not all at once). Most candidates cover the happy path and miss the isolation and failure-mode components.

✅ Related lessons and simulators

The pub-sub event model underlying the fan-out step is covered in rel-10 Event-Driven Pub-Sub. Webhook debugging workflows (inspecting signatures, replaying events in development) are in dbg-04 Debugging Webhooks. For queue backpressure under delivery-worker load, open sim-04 Queue Backpressure and increase the slow-endpoint percentage.

🧠 Quick check

1. A webhook delivery system uses a single shared queue for all endpoints. One endpoint starts taking 30 seconds to respond. What failure mode does this cause?

A shared queue with a fixed worker pool means slow endpoints consume workers while they are waiting for HTTP responses. If one endpoint takes 30 s and there are 10 workers, one endpoint alone can occupy all 10 workers (30 s × workers-needed = high concurrency drain). Per-endpoint queues solve this by giving each endpoint its own worker budget so one slow endpoint cannot starve others.

2. A customer's webhook handler returns HTTP 200 but does so after 45 seconds of processing. Your delivery worker has a 30-second timeout. What sequence of events follows?

The delivery worker's 30-second timeout fires before the customer's 200 arrives. From the delivery system's perspective, this is a failure — it never received confirmation. The delivery is rescheduled for retry. When the retry arrives, the customer's handler runs again and returns 200 in 45 s — but by then the retry has already timed out too. Eventually a retry might arrive when the endpoint is faster. Meanwhile the customer processes the event multiple times if their handler is not idempotent.

3. A customer reports that webhook signature verification is failing on their server, even though they are using the correct secret. The most likely cause is:

Re-serializing a parsed JSON body is the most common signature verification failure. JSON serialization is not canonical: key ordering, whitespace, and number formatting can differ between parse-and-re-encode vs the original bytes. Always compute the HMAC over the raw body bytes as received, before any parsing. The other options are possible but far less common in practice.

4. Which statement correctly describes the at-least-once delivery guarantee offered by webhook systems?

At-least-once means the delivery system retries until it confirms a 2xx response from the endpoint. If the endpoint returns 2xx but then crashes before persisting the event, the system will retry (because the connection dropped before the 2xx was received), resulting in a duplicate delivery. Consumers must use the event's stable ID to detect and skip duplicate processing — the system itself does not suppress duplicates.

🏗️ Exercise — Design the delivery worker for a webhook system
Scenario You are building the delivery worker component of a webhook system for a SaaS platform. Requirements: sign every delivery with HMAC-SHA256; respect a 5-second response timeout; retry with exponential backoff up to 10 attempts; move to DLQ on failure; circuit-break after 5 consecutive failures; support secret rotation without downtime.

Model answer:

  1. Signature: Compute HMAC_SHA256(secret, unix_timestamp + "." + raw_body_bytes). Include Webhook-Timestamp and Webhook-Signature: sha256={hex_sig} in headers. For rotation, sign with the newest active secret; receivers should verify against all currently active secrets for the endpoint until old ones expire.
  2. HTTP delivery: POST to the endpoint URL with a 5-second connection + response timeout. Accept any 2xx status as success. Treat all non-2xx responses and all timeouts as failures. Log the HTTP status, response body (truncated to 1 KB), and latency for every attempt.
  3. Retry schedule: On failure, compute delay = min(3600, 5 × 2^attempt) seconds plus up to 20% jitter. Schedule the next attempt at now() + delay. After attempt 10, write to DLQ and set delivery status to failed.
  4. Circuit breaker: Maintain per-endpoint state: consecutive failure count. After 5 consecutive failures, set status to open and stop attempting delivery. After a probe interval (e.g., 30 minutes), try one request in half-open state. On success, reset to closed. On failure, return to open and extend probe interval. Notify endpoint owner via email when circuit opens.
  5. Idempotency contract: Document that each event carries a stable id; consumers should insert it into a processed-events table with a UNIQUE constraint before acting. Respond 2xx immediately; process asynchronously.

Rubric: ✓ HMAC-SHA256 over timestamp+body (not just body) ✓ Raw body bytes for signing (not re-serialized) ✓ 5-second timeout ✓ Exponential backoff formula ✓ DLQ after max attempts ✓ Circuit breaker with owner notification ✓ Secret rotation described (multiple active secrets). Six or more = strong answer. Most candidates miss the raw-bytes signing detail and the rotation mechanism.

Key takeaways

Sources & further reading