Debugging & Real-World · Lesson 04
Debugging webhooks
Webhooks flip the direction of an API call — instead of you asking the provider for data, the provider sends it to you when something happens. That reversal creates a whole new class of bugs: your endpoint goes down, the provider retries, events arrive out of order, or a malicious actor sends fake payloads. Knowing how to debug each failure mode is essential for any production integration.
By the end you'll be able to
- Describe the webhook delivery lifecycle — event, POST, response, retry — and where each failure mode lives.
- Verify a webhook payload's HMAC signature before processing it.
- Design an idempotent webhook handler that survives duplicate deliveries and out-of-order events.
What a webhook is (and why it's harder to debug than a regular API call)
When you call an API endpoint, you control both sides of the conversation: you send the request, you wait for the response, you handle the error. Webhooks break that symmetry. The provider initiates the call at a time you didn't choose, to a URL you registered in advance. Your endpoint must be publicly reachable, must respond quickly, and must handle the same event being delivered more than once.
That asymmetry creates three debugging challenges that don't exist for outbound API calls:
- You can't reproduce a delivery on demand (you have to wait for the provider to fire the event, or use a replay feature if they have one).
- You can't intercept the delivery in a browser DevTools panel — you need server-side logging.
- You can't easily test locally without a tunnel (the provider's server can't reach
localhost:3000).
The delivery lifecycle
Missed events: your endpoint was down
The most common webhook problem: your endpoint was down (a deploy, a crash, an expired TLS certificate, a firewall change) and events arrived while it was unreachable. Reputable providers retry failed deliveries with exponential backoff — typically for 24–72 hours before giving up. When your endpoint comes back up, it will receive a burst of retried deliveries.
To debug missed events:
- Check the provider's webhook delivery dashboard. Most providers (Stripe, GitHub, Twilio, etc.) show a log of every delivery attempt, the HTTP status your endpoint returned, and the exact timestamp. This tells you exactly which events were missed and whether they were retried.
- If the provider retried them and your endpoint was up by then, the events should have been processed. Check your database or job queue for those event IDs.
- If the provider has a "replay" feature, use it to resend specific events. Do this only after your endpoint is back up and healthy — replaying into a broken endpoint just adds more failures to the log.
- If the retry window has passed, you'll need to reconcile by calling the provider's REST API to fetch the current state of affected resources (e.g., query all payments with status
succeededin the past 24 hours).
Verifying signatures: HMAC to the rescue
Your webhook endpoint is a public URL. Anyone can POST to it with a crafted payload that looks like a real event. If you trust the payload without verification, an attacker can trigger arbitrary actions in your system — crediting accounts, fulfilling orders, changing user data — by sending fake webhook events.
The standard defence is an HMAC (Hash-based Message Authentication Code) signature. The provider sends a header (e.g., Stripe-Signature, X-Hub-Signature-256) containing a hash of the raw request body using a shared secret. Your handler computes the same hash using the same secret and compares. If they match, the payload came from the provider. If they don't, reject it with 400.
JSON parsers may reformat the payload — changing whitespace, reordering keys, normalizing numbers. The signature was computed over the exact bytes the provider sent. If you parse the JSON first and then re-serialize it for signature verification, the bytes will differ and the signature will never match. Always capture the raw request body as a byte string before parsing, and use that for verification.
Idempotent handling: events arrive more than once
Webhook delivery is "at-least-once": the provider will retry until it receives a 2xx response. If your endpoint processes the event and then crashes before returning 200, the provider will retry and your handler will process the same event again. Your handler must be idempotent — processing the same event twice must produce the same result as processing it once.
The standard technique is to record each event ID in a deduplication store (a database table, a Redis set) before processing it. On receipt, check: have we seen this event ID before? If yes, return 200 immediately without reprocessing. This is called idempotent consumption. See Lesson 05 for the full pattern with examples.
Ordering is not guaranteed
Events can arrive out of order. A payment_intent.updated event may arrive after a payment_intent.succeeded event that was created later, because the provider's delivery infrastructure doesn't guarantee ordering across events of different types (or sometimes even the same type).
Design your handlers to be order-agnostic. Use the event's created timestamp to decide whether a newer event should overwrite an older state. Don't assume that the first event you receive is the earliest event chronologically.
Return 200 fast — process asynchronously
Most providers timeout a webhook delivery in 5–30 seconds. If your handler tries to do everything synchronously — write to the database, send an email, update an external API — it may not finish in time, and the provider will retry thinking the delivery failed. The correct pattern:
- Receive the POST request.
- Verify the signature.
- Write the raw event payload to a queue or job table (this should take under 100 ms).
- Return 200 immediately.
- A separate background worker reads the queue and processes the event — takes as long as it needs.
This pattern also means your endpoint is resilient to downstream slowness. If the email service is having a slow moment, it doesn't make your webhook endpoint time out and trigger a retry storm.
Testing locally with a tunnel
The provider can't POST to localhost:3000. To test locally, you need a tunnel: a tool that creates a public URL that forwards traffic to your local port. Common options: ngrok, Cloudflare Tunnel, and the Stripe CLI's built-in webhook proxy (stripe listen --forward-to localhost:3000/webhooks).
Interviewers asking about webhooks frequently probe three specifics: signature verification (HMAC), idempotency (dedup by event ID), and the "return 200 fast, process async" pattern. Knowing all three — and being able to explain why each is necessary — signals production experience. The most common gap is candidates who understand signatures but haven't thought about ordering or the delivery-before-processing pattern.
Every major webhook provider has a way to trigger test events — Stripe's CLI, GitHub's "redeliver" button, Twilio's webhook test tool. Use it. Don't wait for a real production event to test a code change. Fake events with known payloads are faster, reproducible, and don't have real-world side effects (no charges, no emails sent to real users).
Under the hood: how it actually works
Every part of the phrase "verify the HMAC signature" hides a concrete mechanism. Here is exactly what both sides do.
How the provider computes the signature
When an event fires, the provider constructs the signing input by concatenating the Unix timestamp and the raw JSON body with a dot separator, then computes HMAC-SHA256 over that string using the webhook signing secret you configured. The result is hex-encoded and placed in the delivery header alongside the timestamp. Stripe's format is representative:
# The raw payload bytes the provider will sign:
raw_body = b'{"id":"evt_1NkLmXY","type":"payment_intent.succeeded","data":{"object":{"id":"pi_3N1abc","amount":4900}}}'
# Timestamp included in the signature (UNIX seconds, as a string):
timestamp = "1718380860"
# Signing input: timestamp + "." + raw_body
signing_input = b"1718380860." + raw_body
# HMAC-SHA256(key=signing_secret, msg=signing_input) → hex digest
import hmac, hashlib
signature = hmac.new(
b"whsec_test_abc123", # your signing secret
signing_input,
hashlib.sha256
).hexdigest()
# → e.g. "7d3e4c2a1f8b0e9d6c5a4b3f2e1d0c9a8b7e6f5d4c3b2a1..."
# Header sent on every delivery:
# Stripe-Signature: t=1718380860,v1=7d3e4c2a1f8b0e9d6c5a4b3f2e1d0c9a8b7e6f5d4c3b2a1...
How your endpoint recomputes and compares
Your handler parses the header into its components, reconstructs the same signing input, computes the same HMAC, and compares the two hex digests. The comparison must be constant-time — using a regular string equality (==) leaks how many leading bytes matched via timing, which an attacker on the same network can measure to forge signatures byte-by-byte. Python's hmac.compare_digest, Node's crypto.timingSafeEqual, and Go's hmac.Equal are all constant-time.
The timestamp window that blocks replay attacks
The provider includes the timestamp so your handler can reject replays. If an attacker captures a valid delivery (correct signature), they can re-POST it minutes or hours later and get the same effect (a second charge, a second email). The timestamp window closes this gap: reject any delivery whose t= value is more than N seconds old. Five minutes (300 s) is the Stripe default; tighter is safer but requires your clocks to be in sync (NTP). This does not protect against replay within the window — that is idempotency's job (dedupe by event ID).
The retry schedule and why idempotency is mandatory
Delivery uses exponential backoff. A typical schedule (Stripe-style): first failure → retry after ~5 s; second failure → ~30 s; third → ~5 min; then ~30 min, ~2 h, ~5 h, ~10 h, ~24 h, ~48 h — up to 18 attempts over 72 hours before the event is marked failed. Because the provider retries until it sees a 2xx, and because your endpoint may crash after processing but before returning 200, the same event can be processed more than once. Idempotency (dedupe by event ID) is not optional.
Traced delivery: from event fire to retried delivery
The timestamp window (reject deliveries older than 5 min) stops a replay that arrives after the window. But the provider itself can legitimately retry the same event within seconds — that is not a replay attack, that is normal delivery. Your endpoint will see the same event ID more than once within the window, and the timestamp check will pass both times. Only the dedupe-by-event-ID step prevents the double processing. Both defences are required; neither replaces the other.
- Stripe — Webhook signature verification (exact header format & algorithm)
- RFC 2104 — HMAC specification
In production: how leading APIs do it
When your webhook handler fails signature verification, the debugging question is always the same: are you reproducing the exact bytes and signing algorithm the provider used? Stripe, GitHub, and HubSpot all use HMAC-SHA256 but differ in what goes into the signing input — and getting that wrong is the most common cause of persistent 400 errors in the delivery log.
| Provider | Signature header | What is signed | Key debug gotcha | Docs |
|---|---|---|---|---|
| Stripe | Stripe-Signature: t=<unix>,v1=<hex> |
HMAC-SHA256 over "{t}.{raw_body}" — the timestamp is part of the signed material |
If you sign only raw_body without prepending "{t}.", the computed HMAC will never match, even with the correct secret |
Stripe webhook signatures |
| GitHub | X-Hub-Signature-256: sha256=<hex> |
HMAC-SHA256 over the raw request body only, using the webhook secret | No timestamp in the signed material; if your reverse proxy modifies or re-encodes the body (e.g., re-serialises JSON), the bytes change and verification fails | GitHub validating deliveries |
| HubSpot | X-HubSpot-Signature (v3) |
HMAC-SHA256 over method + URI + body + timestamp using the app client secret |
The URI must exactly match the registered endpoint URL including scheme and path; a redirect (HTTP → HTTPS) changes the URI and breaks the signature | HubSpot validating requests |
Deep dive: reproducing Stripe's HMAC locally to verify a failed delivery
Stripe's Stripe-Signature header has the form t=1718380860,v1=7d3e4c2a.... To reproduce the signature, you need three things from the delivery log: the raw body bytes exactly as received, the timestamp value (t=), and the endpoint signing secret. The signing input is "{t}.{raw_body}" — a string concatenation, not JSON wrapping. A common mistake is to use the parsed-and-re-serialised body instead of the raw bytes. JSON parsers frequently reorder keys or normalise whitespace; even a single added space produces a completely different HMAC and a permanent mismatch.
To reproduce the signature in a shell using the exact bytes from a failed delivery:
The two most reliable ways to confirm the raw bytes are correct: (1) log len(raw_body) and compare with the Content-Length header — any mismatch means a middleware modified the body; (2) compute sha256(raw_body) and compare it with the sha256 Stripe would have used as input (Stripe's delivery log shows the signing input hash in some dashboard views). If the hashes match, your secret is wrong. If the hashes differ, the body was modified in transit.
🧠 Quick check
1. Your webhook endpoint was down for 20 minutes. When it came back up, it received a burst of events. What is the most important property your handler must have?
During the outage, some events may have been partially processed before the crash. The retried deliveries will re-deliver those same event IDs. An idempotent handler records event IDs it has seen and skips reprocessing them — preventing double charges, duplicate emails, etc. Ordering guarantees are about a different problem; 5-minute rejection would incorrectly drop retried legitimate events.
2. You're verifying a webhook HMAC signature. You parse the JSON payload, re-serialize it, and compute the HMAC over the result. The signature never matches. Why?
The provider signs the exact raw bytes of the request body. Any parsing and re-serialization can change the bytes — even a single extra space or a reordered key makes the HMAC mismatch. Always capture the raw body as bytes before any parsing, and compute the HMAC over those raw bytes.
3. Why should a webhook handler return 200 immediately and process the event in a background worker?
Providers have a delivery timeout (typically 5–30 seconds). If your synchronous handler takes 10 seconds to send an email and the timeout is 5 seconds, the provider will mark the delivery as failed and retry. You'd process the event twice — once from the original delivery that timed out, once from the retry. Returning 200 fast and deferring work prevents this.
4. You receive two webhook events for the same resource, and the second-received event has an earlier created timestamp than the first-received event. What should you do?
Webhook delivery order is not guaranteed. The second-received event may represent an earlier state of the resource. Use the event's created timestamp (which the provider sets) to determine chronological order, and guard against applying stale state on top of fresh state.
✍️ Exercise: design a robust webhook handler
You're building a webhook handler for a payment provider. The provider sends a payment.completed event when a payment succeeds. Your handler should: verify the signature, deduplicate by event ID, return 200 fast, and process asynchronously. Sketch the handler in pseudocode and explain each design decision.
Model answer:
# POST /webhooks/payments
def handle_payment_webhook(request):
# 1. Capture raw body BEFORE any parsing
raw_body = request.raw_body_bytes
sig_header = request.headers["X-Payment-Signature"]
# 2. Verify signature — reject immediately if invalid
if not verify_hmac(raw_body, sig_header, WEBHOOK_SECRET):
return Response("invalid signature", status=400)
# 3. Parse JSON only after verification
event = json.loads(raw_body)
# 4. Deduplicate by event ID
if db.seen_event(event["id"]):
return Response("already processed", status=200) # 200, not 409
# 5. Write to job queue (fast — milliseconds)
job_queue.enqueue("process_payment_event", payload=event)
db.record_event(event["id"])
# 6. Return 200 immediately
return Response("accepted", status=200)
# Background worker (separate process)
def process_payment_event(event):
if event["type"] == "payment.completed":
order = db.find_order(event["metadata"]["order_id"])
order.mark_paid()
email.send_confirmation(order.customer_email)
Design decisions to explain:
- Raw body captured before
json.loads()— signature is over exact bytes. - 400 on invalid signature — don't process unauthenticated payloads, and the 400 tells the provider not to retry.
- 200 on a duplicate event ID — the provider already delivered this event successfully; returning 200 prevents unnecessary retries.
- Job queue for async processing — handler returns in under 100 ms regardless of downstream latency.
- Background worker handles business logic — if it fails, the event is already in the queue and can be retried independently.
Rubric: ✓ Raw body used for signature verification ✓ Signature check before JSON parse ✓ Deduplication returns 200 (not 409) ✓ Enqueue before return ✓ Business logic in worker, not handler ✓ At least 3 design decisions explained.
Key takeaways
- Webhooks are provider-initiated — your endpoint must be public, fast, and stateless.
- Verify the HMAC signature using the raw request bytes, before parsing JSON, with a constant-time comparison.
- Events can arrive more than once — deduplicate by event ID before processing.
- Events can arrive out of order — use the event's
createdtimestamp, not delivery order, to determine current state. - Return 200 fast — enqueue the payload and let a background worker do the actual work.
- Missed events: check the provider's delivery dashboard, use the replay feature, or reconcile via the REST API for events whose retry window has passed.
- Test locally with a tunnel (ngrok, Stripe CLI, Cloudflare Tunnel) so you can receive real provider deliveries against your local server.