Reliability & Scale · Lesson 12
Designing webhooks
REST lets clients ask questions; webhooks let servers push answers nobody asked for yet. Done well, a webhook system is a first-class outbound API — with versioned payload contracts, tamper-proof signatures, and reliable delivery guarantees that consumers can build production systems on top of.
By the end you'll be able to
- Design a webhook payload with the fields that make it verifiable, replayable, and idempotent.
- Implement HMAC-SHA256 signature verification to prove a delivery is genuine.
- Describe the full delivery pipeline: at-least-once semantics, exponential backoff, dead-letter queue, and why consumers must respond 2xx immediately.
Webhooks are an outbound API
An inbound (REST) API is pull: the client decides when to ask, and the server answers. A webhook is push: the server decides when something happened and calls the client's URL with the news. Think of the difference between checking your mailbox every hour versus having the postal service ring your doorbell the moment a package arrives.
Because the server is now the caller and the consumer's HTTPS endpoint is the "API", all the design discipline we apply to inbound APIs applies here in reverse: the webhook has a contract (payload shape), a versioning story, and a reliability envelope — you just have to build it on the producer side, because your consumers can't version-lock your webhooks themselves.
Payload design: what every event must carry
A webhook payload should be self-describing. Any consumer receiving it in isolation — perhaps hours after the event, or replaying from a dead-letter queue — should be able to process it without making follow-up API calls. The minimum required fields:
{
"id": "evt_01HZK7N5V8XTQP3MJ4GQYK29W", // stable, unique per delivery
"type": "order.paid", // dot-namespaced event type
"api_version": "2025-06-01", // date-based payload schema version
"created_at": "2025-06-20T14:32:00Z", // when the event occurred (ISO-8601 UTC)
"livemode": true, // distinguish production from test events
"data": {
"object": {
"id": "ord_9821",
"customer_id": "cus_4412",
"total_cents": 4999,
"currency": "USD",
"status": "paid"
}
}
}
The id field is the idempotency key — consumers store it and skip reprocessing if they see the same event twice (see rel-02 on idempotency). The api_version field lets you evolve the payload shape without surprise: consumers can pin to a schema date and receive that version's shape, while newer registrations get a newer shape.
Signing payloads with HMAC
Anyone on the internet can POST JSON to your consumer's endpoint. Without a signature, a consumer has no way to tell a genuine delivery from a forged one. The standard approach is HMAC-SHA256: the producer signs the payload body with a shared secret and includes the signature in a request header. The consumer recomputes it and rejects anything that doesn't match.
Signature verification — worked example
Below is a concrete webhook delivery with a signature header, followed by verification pseudo-code that a consumer would run on every inbound request.
# Delivery — headers sent by producer
POST /webhooks/orders HTTP/1.1
Host: consumer.example.com
Content-Type: application/json
X-Hook-Signature: sha256=3a7f2d9c1b8e4f0a6d5c2b9e8f7a1d3c4b5e6f7a8b9c0d1e2f3a4b5c6d7e8f9
X-Hook-Timestamp: 1750430045
X-Hook-Id: evt_01HZK7N5V8XTQP3MJ4GQYK29W
{ /* ... payload body ... */ }
# Consumer verification (Python-style pseudo-code)
import hmac, hashlib, time
WEBHOOK_SECRET = "whsec_super_random_64_byte_hex_string"
MAX_AGE_SECONDS = 300 # reject deliveries older than 5 minutes
def verify_webhook(headers, raw_body):
sig_header = headers["X-Hook-Signature"] # "sha256=abc123..."
timestamp = int(headers["X-Hook-Timestamp"])
event_id = headers["X-Hook-Id"]
# 1. Reject stale deliveries (replay attack guard)
if abs(time.time() - timestamp) > MAX_AGE_SECONDS:
raise ValueError("Delivery too old — possible replay")
# 2. Recompute signature: HMAC-SHA256(secret, timestamp + "." + body)
signed_payload = f"{timestamp}.{raw_body.decode()}"
expected_sig = "sha256=" + hmac.new(
WEBHOOK_SECRET.encode(), signed_payload.encode(), hashlib.sha256
).hexdigest()
# 3. Constant-time comparison (prevents timing attacks)
if not hmac.compare_digest(expected_sig, sig_header):
raise ValueError("Signature mismatch — reject")
# 4. Idempotency check — have we processed this event id before?
if event_store.exists(event_id):
return # already processed — safe to ack again
# 5. Enqueue for async processing — return 200 FAST
job_queue.enqueue(process_event, raw_body)
event_store.mark_seen(event_id)
return 200
Delivery semantics: at-least-once
Webhook systems almost universally guarantee at-least-once delivery: the producer will keep retrying until the consumer acknowledges with a 2xx response. This means the same event can arrive twice — on a retry after a network blip, even if the consumer processed it fine the first time. Consumers must therefore be idempotent: processing the same event twice must produce the same result as processing it once. The id field in the payload is the idempotency key — store it in a seen-events table and skip duplicate processing (see rel-02).
Exactly-once delivery is theoretically possible but practically never guaranteed across a network boundary — the consumer's 2xx might get lost, causing the producer to retry even though the consumer already did the work. Design for idempotency instead of hoping for exactly-once.
Retries, backoff, and dead-letter queues
When a consumer returns 5xx, times out, or fails to respond, the producer should not hammer it. A well-designed retry schedule uses exponential backoff with jitter (see rel-05): first retry after ~30 seconds, then ~2 minutes, then ~10 minutes, up to a maximum of perhaps 72 hours. After exhausting retries, the event moves to a dead-letter queue (DLQ) — a persistent store of permanently undelivered events that operations teams can inspect and replay manually once the consumer is back up.
Key retry rules:
- Retry on 5xx, timeouts, and connection errors. Do not retry on 4xx — a 400/401/403 indicates the consumer rejected the message intentionally; retrying won't help.
- Provide a dashboard or API for consumers to view and replay events from the DLQ.
- Expose the full delivery attempt log (timestamp, status code, response body) so consumers can diagnose why deliveries are failing.
Ordering not guaranteed
Webhook delivery systems do not generally preserve event ordering. A subscription.renewed event that fires one second after a subscription.created may arrive first due to retry timing or parallelism. Consumers must treat each event as independent and use the created_at timestamp plus their own state to resolve ordering — for example, comparing the event's object state against what they already have in their database, and discarding stale events using a "last-write-wins" strategy keyed on timestamp.
Endpoint registration and management
Consumers need a self-service way to register and manage their webhook endpoints. The standard surface area for a webhook management API:
POST /v1/webhooks— register an endpoint URL and select event types to subscribe to.GET /v1/webhooks/:id— inspect a registration (masked secret, enabled events, status).PATCH /v1/webhooks/:id— update the URL or event subscriptions.DELETE /v1/webhooks/:id— deregister.POST /v1/webhooks/:id/test— send a test event immediately so the consumer can verify their endpoint is working.
Return the signing secret only once at registration time. Never return it again in subsequent GET calls — treat it like a password. Provide a rotation endpoint if consumers need to change their secret without downtime.
The canonical system design question is "design a webhook delivery system." The answer hits: (1) event queue (Kafka/SQS/etc.) decouples event generation from delivery workers; (2) delivery workers POST with HMAC signature + retry logic; (3) at-least-once semantics — consumers must be idempotent; (4) exponential backoff + dead-letter queue for permanent failures; (5) a delivery log so consumers can diagnose issues; (6) respond 2xx fast and process asynchronously. Mentioning ordering not being guaranteed and how consumers should handle it is a senior-level signal.
Three traps in one webhook handler: (1) skipping signature verification — anyone can POST to your endpoint; (2) doing slow work before returning 2xx — if the handler times out, the producer retries even though you may have already processed the event, and your work runs twice; (3) assuming exactly-once delivery — retries will happen; not storing the event id and checking for duplicates leads to double-charges, double-emails, and corrupted state.
Do include a timestamp in the signed payload string (not just the body), and reject events older than 5 minutes. This prevents replay attacks where an attacker captures a valid delivery and resends it later. Don't sign only the body — a signature over a static body is replayable indefinitely.
Under the hood: the signing and delivery mechanism
The phrase "sign the payload with HMAC" hides three concrete choices: what exactly gets signed, how the signature is transmitted, and how the consumer verifies it safely. This section traces the exact bytes from producer to consumer.
What the producer computes
A signature over only the raw body is replayable — a valid request captured today would pass verification forever. To prevent replay attacks, the producer includes a Unix timestamp in the signed material. The signing input is:
# Signing input = timestamp (seconds since epoch) + "." + raw request body
signed_payload = "1750430045.{\"id\":\"evt_01HZK7N5V8XTQP3MJ4GQYK29W\",\"type\":\"order.paid\",...}"
# HMAC-SHA256 over the signing input with the per-endpoint shared secret
raw_hmac = HMAC-SHA256(key=WEBHOOK_SECRET, msg=signed_payload)
signature = "sha256=" + hex_encode(raw_hmac)
# e.g. "sha256=3a7f2d9c1b8e4f0a6d5c2b9e8f7a1d3c4b5e6f7a8b9c0d1e2f3a4b5c6d7e8f9"
The webhook secret (WEBHOOK_SECRET) is generated once at registration time as a cryptographically random string (typically 32–64 bytes of entropy), stored hashed in the producer's database, and shown in plaintext only once to the consumer at registration. Each registered endpoint gets its own distinct secret — a security breach at one consumer does not compromise others.
The delivery: headers the producer sends
POST /webhooks/orders HTTP/1.1
Host: consumer.example.com
Content-Type: application/json
X-Hook-Timestamp: 1750430045 // Unix seconds — included in signed material
X-Hook-Signature: sha256=3a7f2d9c1b8... // HMAC-SHA256 over timestamp + "." + body
X-Hook-Id: evt_01HZK7N5V8XTQP3MJ4GQYK29W // event id for dedup
{
"id": "evt_01HZK7N5V8XTQP3MJ4GQYK29W",
"type": "order.paid",
"created_at": "2025-06-20T14:34:05Z",
"data": { ... }
}
What the consumer does — step by step
The consumer's handler must do five things, in this order, before touching the business logic:
import hmac, hashlib, time
WEBHOOK_SECRET = "whsec_super_random_64_byte_hex_string"
MAX_AGE_SECONDS = 300 # 5 minutes — reject deliveries older than this
def verify_webhook(headers, raw_body: bytes):
# Step 1 — Reject stale requests (replay defence)
# Signature alone doesn't expire. The timestamp check does.
ts = int(headers["X-Hook-Timestamp"])
if abs(time.time() - ts) > MAX_AGE_SECONDS:
raise ValueError("Delivery too old — possible replay attack")
# Step 2 — Recompute the signature the producer would have made
signed_payload = f"{ts}." + raw_body.decode("utf-8")
expected = "sha256=" + hmac.new(
WEBHOOK_SECRET.encode(), signed_payload.encode(), hashlib.sha256
).hexdigest()
# Step 3 — Constant-time comparison (prevents timing attacks)
# A byte-by-byte compare leaks how many bytes matched via timing.
# hmac.compare_digest always takes the same time regardless of content.
provided = headers["X-Hook-Signature"]
if not hmac.compare_digest(expected, provided):
raise ValueError("Signature mismatch — reject")
# Step 4 — Idempotency dedup: have we seen this event id?
event_id = headers["X-Hook-Id"]
if event_store.exists(event_id):
return 200 # already processed — acknowledge without reprocessing
# Step 5 — Enqueue the work; return 200 immediately
# NEVER do slow work here (DB writes, HTTP calls, email sends).
# If this handler blocks and times out, the producer retries
# and your work runs twice.
job_queue.enqueue(process_event, raw_body)
event_store.mark_seen(event_id, ttl_days=7)
return 200
Why constant-time comparison matters
A naive string equality check (expected == provided) returns early as soon as a byte differs. An attacker making millions of requests can measure response time to learn how many leading bytes of their forged signature match the real one — a timing oracle. Python's hmac.compare_digest (and its equivalents in all languages) always reads every byte of both strings before returning, so response time carries no information about which bytes matched. This is not theoretical — timing attacks against HMAC have been demonstrated in practice on unhardened web frameworks.
The retry schedule and DLQ
When the consumer returns anything other than 2xx (or times out), the producer must retry. A well-designed retry schedule uses exponential backoff with jitter to avoid thundering-herd effects when a consumer recovers after an outage. A concrete example:
| Attempt | Delay after previous | Cumulative elapsed |
|---|---|---|
| 1 (initial) | — | 0s |
| 2 | ~30s | ~30s |
| 3 | ~2 min | ~2.5 min |
| 4 | ~10 min | ~12.5 min |
| 5 | ~30 min | ~43 min |
| 6 | ~2 h | ~2h 43 min |
| 7 | ~6 h | ~9h |
| 8 (final) | ~24 h | ~33 h |
| → DLQ | Event moved to dead-letter queue; no more automatic retries | |
Jitter means the actual delay is random within ±20% of the base value, so ten consumers that all went down simultaneously don't all reconnect and hammer the producer at the exact same instant. Never retry on 4xx — a 400 Bad Request or 401 Unauthorized from the consumer means the payload is wrong or the secret is wrong; retrying will produce the same result forever.
How to debug & inspect it
Webhook failures surface in one of three ways: the producer's delivery dashboard shows failed attempts, the consumer's logs show rejection errors, or a business process silently stops working (order not fulfilled, invoice not sent). Start at the producer's delivery log, not the consumer's handler.
Verify a signature locally
If you suspect a signature mismatch, reproduce the producer's computation locally using the raw body and the timestamp from the delivery log:
Inspect delivery attempts from the producer API
Replay an event from the DLQ
| Symptom | Likely cause | Fix |
|---|---|---|
| Consumer returns 400 to every delivery | Signature verification is rejecting the request — wrong secret, or body was modified (e.g. by a reverse proxy that re-compresses JSON) | Reproduce the HMAC locally (see above); verify the consumer reads the raw unmodified request body, not a parsed-and-re-serialised version |
| Consumer returns 403 to every delivery | Timestamp check is failing — consumer clock skewed, or MAX_AGE too tight | Check date on the consumer server; NTP sync; widen MAX_AGE temporarily to diagnose; Stripe uses 5 min, GitHub uses 1 min — pick what suits your threat model |
| Delivery log shows 200 but the business effect ran twice | Consumer is doing slow work before returning 200 — timed out, retried, processed twice | Move all work to an async queue; return 200 within <5 seconds of signature check passing |
| Events arrive out of order | Retry timing: an earlier event that failed on attempt 1 arrives after a later event that succeeded on attempt 1 | Consumer must use created_at + object state comparison to detect stale events; last-write-wins on timestamp |
| DLQ depth growing; no errors logged by consumer | Consumer endpoint is unreachable (DNS failure, firewall rule, expired TLS cert) | Test the endpoint URL directly with curl; check TLS cert expiry; check firewall/ingress rules |
| Same event processed multiple times despite idempotency check | Dedup store (Redis/DB) has a TTL shorter than the retry window, or the mark_seen write happens after processing instead of before | Set dedup TTL ≥ max retry window (~72 h); write the event id to the store before enqueuing (not after) |
Debug checklist:
- Check the producer's delivery log for the event — what HTTP status did the consumer return?
- If signature mismatch (400/403): reproduce the HMAC locally with the raw body and timestamp; check that no middleware is rewriting the body.
- If timestamp failure: verify the consumer's system clock is NTP-synced; check MAX_AGE_SECONDS.
- If 200 returned but work ran twice: ensure slow processing is fully deferred to an async worker before the 200 is returned.
- If DLQ has entries: replay manually after fixing the underlying consumer bug; add a DLQ depth alarm for early warning.
- If duplicate effects despite dedup: confirm the event ID is stored before processing begins, and that the TTL of the dedup store exceeds the full retry window.
In production: how leading APIs do it
Stripe, GitHub, and HubSpot each ship a publicly documented webhook signing scheme. Comparing them side by side reveals a common core — HMAC-SHA256 over the raw body — and meaningful differences in what gets mixed into the signing input and how replay attacks are blocked.
| Provider | Signature header | Signing input | Replay defence | Docs |
|---|---|---|---|---|
| Stripe | Stripe-Signaturet=<unix>,v1=<hex> |
HMAC-SHA256 over "{timestamp}.{raw_body}" using the endpoint signing secret |
Reject if |now − t| exceeds tolerance window (default 5 min); timestamp is in the signed material so it cannot be stripped |
Stripe webhook signatures |
| GitHub | X-Hub-Signature-256sha256=<hex> |
HMAC-SHA256 over the raw request body only, using the webhook secret | No timestamp in the signature; replay defence relies on HTTPS and the consumer's own idempotency checks | GitHub validating deliveries |
| HubSpot | X-HubSpot-Signature(v3: HMAC-SHA256) |
HMAC-SHA256 over method + URI + body + timestamp using the app's client secret |
Timestamp included in signed material; recommended to reject requests outside a tolerance window | HubSpot validating requests |
Deep dive: Stripe's timestamp-in-signature replay defence
Stripe's design is worth examining as a pattern to follow when you build your own webhook producer. The key insight is that the timestamp is not just sent as a separate header — it is concatenated into the signing input itself: "{t}.{raw_body}". This means an attacker who captures a valid delivery cannot strip the timestamp or substitute a newer one without invalidating the signature, because any change to the signed string produces a completely different HMAC output.
The consumer receives two pieces from the Stripe-Signature header: the timestamp (t=) and the HMAC (v1=). Verification proceeds in two independent steps: (1) recompute HMAC-SHA256(secret, "{t}.{body}") and compare with v1= using a constant-time function — this proves the payload is authentic and that the specific timestamp was present when the producer signed it; (2) check that |now − t| ≤ tolerance — this rejects a genuine signature that was captured and replayed after the window expires. Neither check alone is sufficient: signature without timestamp check allows replay; timestamp check without signature check allows forgery. GitHub's scheme, by contrast, omits the timestamp from the signed material, which means a captured delivery with a valid signature is replayable indefinitely unless the consumer implements its own event-ID deduplication.
The practical implication for your own webhook design: always include a timestamp in the signed string, not merely as a header, and reject deliveries outside a tolerance window. This is the pattern both Stripe and HubSpot (v3) converge on.
🧠 Quick check
1. Why must a webhook consumer return a 2xx response immediately and process the event asynchronously?
The producer's retry logic fires on anything that isn't a prompt 2xx — including a 200 that arrives after a 30-second timeout. If the consumer already did the work before timing out, the retry causes double processing. The fix: acknowledge fast, enqueue the slow work for an async worker.
2. A consumer receives the same webhook event twice (same id). What should it do on the second delivery?
At-least-once delivery means duplicates happen. The correct response is to acknowledge with 2xx (so the producer stops retrying) but skip processing because the event id was already stored as seen. Returning 400 would trigger retries on some systems and not others — unreliable. Always ack, never reprocess a duplicate.
3. What is the purpose of including a timestamp in the HMAC signature computation (rather than signing only the body)?
If the signature covers only the body, an attacker who captures a legitimate delivery can replay it forever — the signature will always verify. Including a timestamp (and rejecting events older than a few minutes) means replayed packets fail the age check, even if the signature itself is valid.
4. After exhausting all retries on a failed webhook delivery, what should the producer do with the event?
Discarding means the consumer permanently misses the event. Retrying forever wastes resources and could violate rate limits. A dead-letter queue is the right answer: preserve the event, surface it in a dashboard, and let operations replay it once the consumer recovers.
✍️ Exercise: design a webhook system for a payment platform
You are designing the webhook system for a payment platform. Merchants register HTTPS endpoints and receive events like payment.succeeded, payment.failed, and refund.created. Sketch the following:
- The full JSON payload shape for a
payment.succeededevent. - The headers you would send with each delivery and why.
- The retry schedule and what happens after all retries are exhausted.
- What a merchant's endpoint handler should do in the first 500ms of receiving the delivery, and what it should defer.
Model answer:
1. Payload: Minimum required: id (globally unique event id, used as idempotency key), type ("payment.succeeded"), api_version (date string), created_at (ISO-8601 UTC), livemode (boolean), data.object containing the payment object (id, amount, currency, merchant_id, customer_id, status). Including the full object snapshot makes the event self-contained — the merchant doesn't need a follow-up API call.
2. Headers: Content-Type: application/json; X-Hook-Id (event id for logging); X-Hook-Timestamp (Unix timestamp of delivery attempt); X-Hook-Signature: sha256=… (HMAC-SHA256 of timestamp + "." + body using the merchant's webhook secret).
3. Retry schedule: Retry on 5xx and timeouts; never retry on 4xx. Schedule: 30s → 2m → 10m → 30m → 2h → 6h → 24h (7 attempts over ~33 hours). After exhausting retries, move to DLQ. Notify the merchant via email/dashboard. Provide a replay API.
4. Handler first 500ms: Verify HMAC signature (reject with 401/403 if invalid); check timestamp freshness (reject if older than 5 minutes); look up event id in seen-events store (return 200 immediately if duplicate); enqueue event id and raw body to an async job queue; return 200 OK. Deferred (async worker): parse payload, run business logic (update order status, send receipt email, release inventory).
Rubric: ✓ Payload includes all six required fields ✓ Signature header with timestamp anti-replay ✓ Correct retry targets (5xx/timeout, not 4xx) ✓ DLQ after exhaustion ✓ Handler returns 2xx before doing slow work ✓ Idempotency check by event id ✓ Constant-time signature comparison mentioned or implied.
Key takeaways
- Webhooks are an outbound API: design the payload contract with the same rigour as an inbound REST endpoint — event id, type, timestamp, api_version.
- Sign every delivery with HMAC-SHA256 over
timestamp + body; consumers must verify before trusting any payload. - At-least-once semantics: the same event will arrive more than once; consumers must store the event id and skip reprocessing duplicates.
- Return 2xx fast — under 5 seconds — and do all slow work in an async worker. Slow handlers cause retries that cause double-processing.
- Exponential backoff + dead-letter queue: retry smartly; never discard permanently undelivered events — put them in a DLQ for replay.
- Ordering is not guaranteed: consumers must use timestamps and object-state comparisons to handle out-of-order delivery.
Sources & further reading
- Stripe — Webhooks documentation (payload shape, HMAC signing, best practices)
- GitHub — Webhooks documentation (event types, securing, delivery validation)
- GitHub — Validating webhook deliveries
- Lesson rel-02 — Idempotency (at-least-once handling)
- Lesson rel-05 — Retries and backoff