API Design

Debugging & Real-World · Lesson 04

Debugging webhooks

Webhooks flip the direction of an API call — instead of you asking the provider for data, the provider sends it to you when something happens. That reversal creates a whole new class of bugs: your endpoint goes down, the provider retries, events arrive out of order, or a malicious actor sends fake payloads. Knowing how to debug each failure mode is essential for any production integration.

⏱ 14 min Difficulty: core Prereq: Lesson 07 (HTTP), dbg-01, dbg-02

By the end you'll be able to

What a webhook is (and why it's harder to debug than a regular API call)

When you call an API endpoint, you control both sides of the conversation: you send the request, you wait for the response, you handle the error. Webhooks break that symmetry. The provider initiates the call at a time you didn't choose, to a URL you registered in advance. Your endpoint must be publicly reachable, must respond quickly, and must handle the same event being delivered more than once.

That asymmetry creates three debugging challenges that don't exist for outbound API calls:

  1. You can't reproduce a delivery on demand (you have to wait for the provider to fire the event, or use a replay feature if they have one).
  2. You can't intercept the delivery in a browser DevTools panel — you need server-side logging.
  3. You can't easily test locally without a tunnel (the provider's server can't reach localhost:3000).

The delivery lifecycle

Provider event fires (e.g. payment.succeeded) POST /webhooks JSON body + signature header Your endpoint verify signature enqueue → return 200 200 OK (within ~5 s) Background worker processes the event after response is sent enqueue retry with exponential backoff if no 2xx within timeout
A webhook handler's only job is to verify the signature and return 200 fast. Real processing happens in a background worker after the response is sent.

Missed events: your endpoint was down

The most common webhook problem: your endpoint was down (a deploy, a crash, an expired TLS certificate, a firewall change) and events arrived while it was unreachable. Reputable providers retry failed deliveries with exponential backoff — typically for 24–72 hours before giving up. When your endpoint comes back up, it will receive a burst of retried deliveries.

To debug missed events:

  1. Check the provider's webhook delivery dashboard. Most providers (Stripe, GitHub, Twilio, etc.) show a log of every delivery attempt, the HTTP status your endpoint returned, and the exact timestamp. This tells you exactly which events were missed and whether they were retried.
  2. If the provider retried them and your endpoint was up by then, the events should have been processed. Check your database or job queue for those event IDs.
  3. If the provider has a "replay" feature, use it to resend specific events. Do this only after your endpoint is back up and healthy — replaying into a broken endpoint just adds more failures to the log.
  4. If the retry window has passed, you'll need to reconcile by calling the provider's REST API to fetch the current state of affected resources (e.g., query all payments with status succeeded in the past 24 hours).
# Stripe delivery log (from the dashboard or webhook CLI) # Event: payment_intent.succeeded | ID: pi_3N1abcXYZ # Attempt 1: 2024-06-20T14:32:01Z → 503 (your server was restarting) # Attempt 2: 2024-06-20T14:32:31Z → 503 # Attempt 3: 2024-06-20T14:33:31Z → 200 OK ← your server came back up # Event was eventually delivered on the 3rd attempt. Your handler must be idempotent # because the first two attempts may have already been partially processed.

Verifying signatures: HMAC to the rescue

Your webhook endpoint is a public URL. Anyone can POST to it with a crafted payload that looks like a real event. If you trust the payload without verification, an attacker can trigger arbitrary actions in your system — crediting accounts, fulfilling orders, changing user data — by sending fake webhook events.

The standard defence is an HMAC (Hash-based Message Authentication Code) signature. The provider sends a header (e.g., Stripe-Signature, X-Hub-Signature-256) containing a hash of the raw request body using a shared secret. Your handler computes the same hash using the same secret and compares. If they match, the payload came from the provider. If they don't, reject it with 400.

⚠️ Critical: verify the raw body, not the parsed body

JSON parsers may reformat the payload — changing whitespace, reordering keys, normalizing numbers. The signature was computed over the exact bytes the provider sent. If you parse the JSON first and then re-serialize it for signature verification, the bytes will differ and the signature will never match. Always capture the raw request body as a byte string before parsing, and use that for verification.

# Python — verifying a Stripe-style HMAC-SHA256 webhook signature # Headers the provider sends: # Stripe-Signature: t=1718380860,v1=abc123def456... import hmac, hashlib, time def verify_signature(raw_body: bytes, sig_header: str, secret: str) -> bool: """Returns True if the payload is authentic.""" parts = dict(item.split("=", 1) for item in sig_header.split(",")) timestamp = parts.get("t", "") expected = parts.get("v1", "") # Reject if timestamp is more than 5 minutes old (replay attack protection) if abs(time.time() - int(timestamp)) > 300: return False # Compute: HMAC-SHA256(secret, timestamp + "." + raw_body) payload = f"{timestamp}.".encode() + raw_body computed = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest() return hmac.compare_digest(computed, expected) # constant-time comparison # In your endpoint handler: raw_body = request.get_data() # raw bytes, BEFORE json.loads() sig_header = request.headers.get("Stripe-Signature", "") if not verify_signature(raw_body, sig_header, WEBHOOK_SECRET): return "Signature verification failed", 400 # Signature valid — now safe to process the event

Idempotent handling: events arrive more than once

Webhook delivery is "at-least-once": the provider will retry until it receives a 2xx response. If your endpoint processes the event and then crashes before returning 200, the provider will retry and your handler will process the same event again. Your handler must be idempotent — processing the same event twice must produce the same result as processing it once.

The standard technique is to record each event ID in a deduplication store (a database table, a Redis set) before processing it. On receipt, check: have we seen this event ID before? If yes, return 200 immediately without reprocessing. This is called idempotent consumption. See Lesson 05 for the full pattern with examples.

Ordering is not guaranteed

Events can arrive out of order. A payment_intent.updated event may arrive after a payment_intent.succeeded event that was created later, because the provider's delivery infrastructure doesn't guarantee ordering across events of different types (or sometimes even the same type).

Design your handlers to be order-agnostic. Use the event's created timestamp to decide whether a newer event should overwrite an older state. Don't assume that the first event you receive is the earliest event chronologically.

# Example: out-of-order events for the same payment_intent 14:32:01 — received: payment_intent.succeeded (created: 14:31:58) 14:32:00 — received: payment_intent.updated (created: 14:31:55) # The .updated event arrived 1 second later but has an earlier created time. # A handler that blindly overwrites state would mark the payment as "updated" # after it was already marked as "succeeded" — wrong. # Fix: only apply the event if its created time is newer than the current state.

Return 200 fast — process asynchronously

Most providers timeout a webhook delivery in 5–30 seconds. If your handler tries to do everything synchronously — write to the database, send an email, update an external API — it may not finish in time, and the provider will retry thinking the delivery failed. The correct pattern:

  1. Receive the POST request.
  2. Verify the signature.
  3. Write the raw event payload to a queue or job table (this should take under 100 ms).
  4. Return 200 immediately.
  5. A separate background worker reads the queue and processes the event — takes as long as it needs.

This pattern also means your endpoint is resilient to downstream slowness. If the email service is having a slow moment, it doesn't make your webhook endpoint time out and trigger a retry storm.

Testing locally with a tunnel

The provider can't POST to localhost:3000. To test locally, you need a tunnel: a tool that creates a public URL that forwards traffic to your local port. Common options: ngrok, Cloudflare Tunnel, and the Stripe CLI's built-in webhook proxy (stripe listen --forward-to localhost:3000/webhooks).

# Option 1: ngrok — creates a public HTTPS URL for localhost:3000 ngrok http 3000 Forwarding https://a4bc-1234.ngrok-free.app → http://localhost:3000 # Register https://a4bc-1234.ngrok-free.app/webhooks in the provider's dashboard # ngrok also gives you a local dashboard at http://127.0.0.1:4040 to inspect deliveries # Option 2: Stripe CLI — forwards Stripe events directly, handles signature for you stripe listen --forward-to localhost:3000/webhooks Ready! Your webhook signing secret is whsec_test_abc123... Listening for Stripe events, forwarding to localhost:3000/webhooks # In another terminal, trigger a test event: stripe trigger payment_intent.succeeded
🎯 Interview angle

Interviewers asking about webhooks frequently probe three specifics: signature verification (HMAC), idempotency (dedup by event ID), and the "return 200 fast, process async" pattern. Knowing all three — and being able to explain why each is necessary — signals production experience. The most common gap is candidates who understand signatures but haven't thought about ordering or the delivery-before-processing pattern.

✅ Use the provider's test event tooling before touching your production webhook

Every major webhook provider has a way to trigger test events — Stripe's CLI, GitHub's "redeliver" button, Twilio's webhook test tool. Use it. Don't wait for a real production event to test a code change. Fake events with known payloads are faster, reproducible, and don't have real-world side effects (no charges, no emails sent to real users).

Under the hood: how it actually works

Every part of the phrase "verify the HMAC signature" hides a concrete mechanism. Here is exactly what both sides do.

How the provider computes the signature

When an event fires, the provider constructs the signing input by concatenating the Unix timestamp and the raw JSON body with a dot separator, then computes HMAC-SHA256 over that string using the webhook signing secret you configured. The result is hex-encoded and placed in the delivery header alongside the timestamp. Stripe's format is representative:

# The raw payload bytes the provider will sign:
raw_body = b'{"id":"evt_1NkLmXY","type":"payment_intent.succeeded","data":{"object":{"id":"pi_3N1abc","amount":4900}}}'

# Timestamp included in the signature (UNIX seconds, as a string):
timestamp = "1718380860"

# Signing input: timestamp + "." + raw_body
signing_input = b"1718380860." + raw_body

# HMAC-SHA256(key=signing_secret, msg=signing_input) → hex digest
import hmac, hashlib
signature = hmac.new(
    b"whsec_test_abc123",   # your signing secret
    signing_input,
    hashlib.sha256
).hexdigest()
# → e.g. "7d3e4c2a1f8b0e9d6c5a4b3f2e1d0c9a8b7e6f5d4c3b2a1..."

# Header sent on every delivery:
# Stripe-Signature: t=1718380860,v1=7d3e4c2a1f8b0e9d6c5a4b3f2e1d0c9a8b7e6f5d4c3b2a1...

How your endpoint recomputes and compares

Your handler parses the header into its components, reconstructs the same signing input, computes the same HMAC, and compares the two hex digests. The comparison must be constant-time — using a regular string equality (==) leaks how many leading bytes matched via timing, which an attacker on the same network can measure to forge signatures byte-by-byte. Python's hmac.compare_digest, Node's crypto.timingSafeEqual, and Go's hmac.Equal are all constant-time.

The timestamp window that blocks replay attacks

The provider includes the timestamp so your handler can reject replays. If an attacker captures a valid delivery (correct signature), they can re-POST it minutes or hours later and get the same effect (a second charge, a second email). The timestamp window closes this gap: reject any delivery whose t= value is more than N seconds old. Five minutes (300 s) is the Stripe default; tighter is safer but requires your clocks to be in sync (NTP). This does not protect against replay within the window — that is idempotency's job (dedupe by event ID).

The retry schedule and why idempotency is mandatory

Delivery uses exponential backoff. A typical schedule (Stripe-style): first failure → retry after ~5 s; second failure → ~30 s; third → ~5 min; then ~30 min, ~2 h, ~5 h, ~10 h, ~24 h, ~48 h — up to 18 attempts over 72 hours before the event is marked failed. Because the provider retries until it sees a 2xx, and because your endpoint may crash after processing but before returning 200, the same event can be processed more than once. Idempotency (dedupe by event ID) is not optional.

Traced delivery: from event fire to retried delivery

# ── Provider side ────────────────────────────────────────────────────────── # T+0.000 payment_intent.succeeded fires; provider builds delivery: # event id : evt_1NkLmXY4f2BLZemn0cZs9vbP # timestamp : t=1718380860 # signing : HMAC-SHA256("1718380860." + raw_body, secret) → v1=7d3e... # POST to : https://api.yourapp.com/webhooks # # T+0.120 YOUR ENDPOINT receives the POST POST /webhooks HTTP/1.1 Host: api.yourapp.com Content-Type: application/json Stripe-Signature: t=1718380860,v1=7d3e4c2a1f8b0e9d6c5a4b3f2e1d0c9a8b7e6f5d4c3b2a1 Content-Length: 113 {"id":"evt_1NkLmXY","type":"payment_intent.succeeded","data":{"object":{"id":"pi_3N1abc","amount":4900}}} # T+0.121 Handler: parse Stripe-Signature → t=1718380860, v1=7d3e... # T+0.122 Replay check: abs(now() - 1718380860) = 0.122 s < 300 s ✓ # T+0.123 Reconstruct signing input: b"1718380860." + raw_body # T+0.124 HMAC-SHA256(signing_input, secret) → computed = 7d3e... # T+0.125 hmac.compare_digest(computed, "7d3e...") → True ✓ # T+0.126 Dedupe check: db.seen("evt_1NkLmXY") → False ✓ (first delivery) # T+0.127 job_queue.enqueue(event); db.record("evt_1NkLmXY") HTTP/1.1 200 OK ← provider marks delivery "succeeded" in its log # ── Provider retries if you returned non-2xx ──────────────────────────────── # Attempt 2 T+5 s → 503 (your server restarting) # Attempt 3 T+35 s → 503 # Attempt 4 T+5 m → 200 ← same evt_1NkLmXY; dedupe fires db.seen("evt_1NkLmXY") → True → return 200 immediately, no second processing
⚠️ The window is one-sided — use it with idempotency, not instead of it

The timestamp window (reject deliveries older than 5 min) stops a replay that arrives after the window. But the provider itself can legitimately retry the same event within seconds — that is not a replay attack, that is normal delivery. Your endpoint will see the same event ID more than once within the window, and the timestamp check will pass both times. Only the dedupe-by-event-ID step prevents the double processing. Both defences are required; neither replaces the other.

In production: how leading APIs do it

When your webhook handler fails signature verification, the debugging question is always the same: are you reproducing the exact bytes and signing algorithm the provider used? Stripe, GitHub, and HubSpot all use HMAC-SHA256 but differ in what goes into the signing input — and getting that wrong is the most common cause of persistent 400 errors in the delivery log.

ProviderSignature headerWhat is signedKey debug gotchaDocs
Stripe Stripe-Signature: t=<unix>,v1=<hex> HMAC-SHA256 over "{t}.{raw_body}" — the timestamp is part of the signed material If you sign only raw_body without prepending "{t}.", the computed HMAC will never match, even with the correct secret Stripe webhook signatures
GitHub X-Hub-Signature-256: sha256=<hex> HMAC-SHA256 over the raw request body only, using the webhook secret No timestamp in the signed material; if your reverse proxy modifies or re-encodes the body (e.g., re-serialises JSON), the bytes change and verification fails GitHub validating deliveries
HubSpot X-HubSpot-Signature (v3) HMAC-SHA256 over method + URI + body + timestamp using the app client secret The URI must exactly match the registered endpoint URL including scheme and path; a redirect (HTTP → HTTPS) changes the URI and breaks the signature HubSpot validating requests

Deep dive: reproducing Stripe's HMAC locally to verify a failed delivery

Stripe's Stripe-Signature header has the form t=1718380860,v1=7d3e4c2a.... To reproduce the signature, you need three things from the delivery log: the raw body bytes exactly as received, the timestamp value (t=), and the endpoint signing secret. The signing input is "{t}.{raw_body}" — a string concatenation, not JSON wrapping. A common mistake is to use the parsed-and-re-serialised body instead of the raw bytes. JSON parsers frequently reorder keys or normalise whitespace; even a single added space produces a completely different HMAC and a permanent mismatch.

To reproduce the signature in a shell using the exact bytes from a failed delivery:

$ TIMESTAMP=1718380860 $ BODY='{"id":"evt_1NkLmXY","type":"payment_intent.succeeded","data":{"object":{"id":"pi_3N1abc","amount":4900}}}' $ SECRET="whsec_test_abc123" # Stripe's signing input: timestamp + "." + raw body (as a single string) $ SIGNING_INPUT="${TIMESTAMP}.${BODY}" # Compute HMAC-SHA256 — must use -n to avoid adding a trailing newline $ printf '%s' "$SIGNING_INPUT" | openssl dgst -sha256 -hmac "$SECRET" | sed 's/^.*= //' 7d3e4c2a1f8b0e9d6c5a4b3f2e1d0c9a8b7e6f5d4c3b2a1... ← compare with v1= from the delivery log # If these differ: the bytes you are signing differ from what Stripe signed. # Check: are you reading raw_body from the request before any JSON parsing? # Check: is a middleware (e.g., body-parser, nginx) re-encoding the request body?

The two most reliable ways to confirm the raw bytes are correct: (1) log len(raw_body) and compare with the Content-Length header — any mismatch means a middleware modified the body; (2) compute sha256(raw_body) and compare it with the sha256 Stripe would have used as input (Stripe's delivery log shows the signing input hash in some dashboard views). If the hashes match, your secret is wrong. If the hashes differ, the body was modified in transit.

How leading APIs do it

🧠 Quick check

1. Your webhook endpoint was down for 20 minutes. When it came back up, it received a burst of events. What is the most important property your handler must have?

During the outage, some events may have been partially processed before the crash. The retried deliveries will re-deliver those same event IDs. An idempotent handler records event IDs it has seen and skips reprocessing them — preventing double charges, duplicate emails, etc. Ordering guarantees are about a different problem; 5-minute rejection would incorrectly drop retried legitimate events.

2. You're verifying a webhook HMAC signature. You parse the JSON payload, re-serialize it, and compute the HMAC over the result. The signature never matches. Why?

The provider signs the exact raw bytes of the request body. Any parsing and re-serialization can change the bytes — even a single extra space or a reordered key makes the HMAC mismatch. Always capture the raw body as bytes before any parsing, and compute the HMAC over those raw bytes.

3. Why should a webhook handler return 200 immediately and process the event in a background worker?

Providers have a delivery timeout (typically 5–30 seconds). If your synchronous handler takes 10 seconds to send an email and the timeout is 5 seconds, the provider will mark the delivery as failed and retry. You'd process the event twice — once from the original delivery that timed out, once from the retry. Returning 200 fast and deferring work prevents this.

4. You receive two webhook events for the same resource, and the second-received event has an earlier created timestamp than the first-received event. What should you do?

Webhook delivery order is not guaranteed. The second-received event may represent an earlier state of the resource. Use the event's created timestamp (which the provider sets) to determine chronological order, and guard against applying stale state on top of fresh state.

✍️ Exercise: design a robust webhook handler

You're building a webhook handler for a payment provider. The provider sends a payment.completed event when a payment succeeds. Your handler should: verify the signature, deduplicate by event ID, return 200 fast, and process asynchronously. Sketch the handler in pseudocode and explain each design decision.

Model answer:

# POST /webhooks/payments
def handle_payment_webhook(request):
    # 1. Capture raw body BEFORE any parsing
    raw_body    = request.raw_body_bytes
    sig_header  = request.headers["X-Payment-Signature"]

    # 2. Verify signature — reject immediately if invalid
    if not verify_hmac(raw_body, sig_header, WEBHOOK_SECRET):
        return Response("invalid signature", status=400)

    # 3. Parse JSON only after verification
    event = json.loads(raw_body)

    # 4. Deduplicate by event ID
    if db.seen_event(event["id"]):
        return Response("already processed", status=200)  # 200, not 409

    # 5. Write to job queue (fast — milliseconds)
    job_queue.enqueue("process_payment_event", payload=event)
    db.record_event(event["id"])

    # 6. Return 200 immediately
    return Response("accepted", status=200)

# Background worker (separate process)
def process_payment_event(event):
    if event["type"] == "payment.completed":
        order = db.find_order(event["metadata"]["order_id"])
        order.mark_paid()
        email.send_confirmation(order.customer_email)

Design decisions to explain:

Rubric: ✓ Raw body used for signature verification ✓ Signature check before JSON parse ✓ Deduplication returns 200 (not 409) ✓ Enqueue before return ✓ Business logic in worker, not handler ✓ At least 3 design decisions explained.

Key takeaways

Sources & further reading