API Design

Debugging & Real-World · Lesson 01

The API debugging mindset

Every API bug looks unique until you realise you're always answering the same question: which layer lied? A systematic loop — reproduce, isolate, hypothesise, verify, fix, protect — turns a wall of mystery into a series of answerable questions.

⏱ 11 min Difficulty: core Prereq: Lessons 01–03, 06–07

By the end you'll be able to

Why systematic beats intuitive

Intuition is experience compressed into a shortcut. It's valuable — but it only works when the bug fits a pattern you've already seen. The first time you hit a problem, you have no pattern. Worse, acting on a wrong hunch first can add confusion: you change something, the symptom shifts slightly, and now you're debugging your fix instead of the original failure.

The systematic loop below is designed to work on bugs you've never encountered. It doesn't require you to know the answer in advance; it requires you to know where to look next. That's a learnable skill, and it transfers to every API, every language, and every company's stack.

The debugging loop

Think of debugging as a scientific method running under time pressure. Each step produces a testable claim; the next step tests it.

1 Reproduce confirm it's real 2 Isolate which layer? 3 Hypothesise one cause, testable 4 Evidence logs / metrics / trace 5 Fix smallest change 6 Regress lock it in a test next bug — start clean
The loop is linear when you're lucky; when step 4 disproves your hypothesis, return to step 3 with new information rather than step 1 from scratch.

Step 1 — Reproduce: confirm it's real before you touch anything

Never start debugging without a reproduction. If you can't make the failure happen on demand, you can't tell whether your fix worked. Write the smallest possible reproduction: the exact curl command, the exact request body, the exact sequence of events that triggers the symptom. If the report is "sometimes it fails," find the condition that makes "sometimes" into "always."

# Reproduce the exact reported failure before changing anything curl -i -X POST https://api.example.com/v1/payments \ -H "Authorization: Bearer sk_live_abc123" \ -H "Content-Type: application/json" \ -d '{"amount":5000,"currency":"usd","source":"tok_visa"}' HTTP/2 422 {"error":{"code":"amount_too_small","message":"Amount must be at least 5000 cents"}} # Good — we can reproduce it reliably. Now we know exactly what to explain.

Step 2 — Isolate by layer

An API call crosses many layers on its way from your code to the database and back. A symptom in one layer is often caused by a different layer. The fastest way to narrow down is to move your test point closer to the source of truth until the symptom disappears — that boundary is where the bug lives.

Client Network / DNS / TLS Gateway / LB Server (app logic) Database / Cache Does removing SDK/library reproduce the error with raw curl? Does curl time out, hang, or fail TLS? → network / DNS Is it a 502/504 with no server log? → gateway dropped it Is there an error in the application log? → server-side Slow queries, lock waits, or connection-pool exhaustion? See Lesson 03 for the full five-layer architecture
Work top-down: eliminate the client layer first by reproducing with a raw curl. Each layer you rule out narrows the search by roughly one order of magnitude.

The layers are described in detail in Lesson 03 — Layers & the Narrow Waist. For debugging purposes, what matters is that each layer has a distinct signature: client bugs show up with an identical library and a clean curl; network bugs show up even when the server is healthy; gateway bugs produce 502/504 with no corresponding server-side log entry; server bugs have stack traces; database bugs have slow-query logs or connection errors.

Step 3 — Hypothesise: one testable claim at a time

A hypothesis is not a guess — it's a claim that predicts observable evidence. "I think the token is expired" is a hypothesis because it predicts: checking the token's exp claim will show a past timestamp, and refreshing the token will make the 401 go away. A hypothesis that doesn't predict anything specific is not a hypothesis; it's a hunch, and hunches lead to random changes.

Write the hypothesis down — even a one-liner in a scratch file. This forces precision and stops you from drifting mid-investigation.

Step 4 — Check evidence: logs, metrics, traces

This is where most of the work happens. Collect only the evidence that tests your current hypothesis; don't dump every log line hoping something jumps out. The three instruments complement each other:

Also check the provider's status page first if the API is external. Spending an hour debugging a provider outage is pure waste.

Step 5 — Fix: smallest change that addresses the evidence

Match the size of your fix to the size of the confirmed cause. If the evidence shows a missing required field, add the field — don't refactor the entire request-building module. Narrow fixes are safer (less chance of introducing a new bug), easier to roll back, and easier for a code reviewer to evaluate. If you find yourself wanting to fix five things at once, you haven't isolated far enough.

Step 6 — Regression test: lock the fix in

A fix without a test is a debt with no due date. The next developer to touch that code — possibly you in six months — has no way to know the constraint exists. Write the smallest test that would have caught the bug: an integration test that sends the malformed request and asserts on the specific error code, or a unit test that exercises the boundary condition you just discovered.

Tools for each stage

ToolBest forWatch out for
curl -iReproducing without any client library; checking raw headersShell quoting around JSON bodies; missing -i loses response headers
Browser DevTools Network tabSeeing exactly what the browser sent/received; timing waterfallCORS errors look like network errors — check the Console tab too
Server/app logsStack traces, request IDs, exact SQL executedLog level set to WARN might suppress the INFO line you need
Provider status pageRuling out a third-party outage in secondsStatus pages are often lagging; check community forums if the page says "operational"
Distributed trace (Jaeger, Zipkin, Honeycomb)Multi-service latency; finding where time goes in a fan-out callSampling rates: if only 1 in 100 requests is traced, a rare bug may never appear
🎯 Interview angle

When an interviewer presents a debugging scenario, they're testing your process more than your knowledge of the specific error. Narrate the loop out loud: "First I'd reproduce it with a raw curl to rule out client library issues, then check whether the error appears in the server logs…" Demonstrating systematic thinking under pressure is the signal they're looking for. Arriving at a correct diagnosis quickly with a clear explanation is worth more than guessing the right answer silently.

Scenario

Your mobile app starts getting reports of intermittent payment failures. The error in your logs reads upstream connect error or disconnect/reset before headers. Apply the loop.

  1. Reproduce. Call the payment endpoint directly with curl, bypassing the mobile SDK. If curl also fails, the client layer is not the culprit.
  2. Isolate. The error message ("reset before headers") is a gateway-level signal — it means the upstream server closed the connection before sending any response. Check: is the gateway healthy? Is this a 502 or a connection-level failure?
  3. Hypothesise. "The payment processor's server is returning a connection reset — possibly an overloaded upstream or a TLS mismatch between the gateway and the upstream."
  4. Check evidence. Pull gateway access logs; look for 502 errors clustering on specific upstream IP. Check the payment processor's status page. Look at the TLS handshake timeout in the gateway config.
  5. Fix. If the upstream is healthy but TLS is timing out: update the gateway's upstream TLS timeout from 5 s to 10 s (match the payment processor's documented SLA). Deploy to staging, reproduce test passes.
  6. Regression test. Add a synthetic monitor that calls the payment endpoint from the gateway's network segment every 60 s and alerts on a connection-reset error, distinct from a normal application error.
⚠️ Common trap

Changing two things at once. If you update the timeout and rotate the API keys simultaneously, and the bug disappears, you don't know which change fixed it. That ambiguity is debt: the next time the bug reappears (or the next person debugging something similar) has no actionable record. Change one variable per investigation step.

✅ Keep an investigation log

Even a simple scratch file — timestamp, hypothesis, what you checked, what you found — is enormously valuable. It prevents you from re-testing things you've already ruled out, and it gives you the material to write a clear post-mortem. "I checked X at 14:15, it was fine" is evidence. "I don't think X is the issue" is a hunch.

Under the hood: a fully worked debugging session

Reading about the loop is one thing; watching it run in real time against a real symptom is another. Below is a concrete, narrated session for a specific failure: intermittent 500s on POST /orders. This is the kind of thing a senior engineer actually does — each step produces evidence that shapes the next move.

Symptom

At 14:07 the on-call engineer receives an alert: error rate on POST /orders jumped from 0.1 % to 8 % over the last five minutes. Users are reporting "Something went wrong" on checkout. The 500s are intermittent — about 1 in 12 requests fails.

Step 1 — Reproduce: nail down a minimal reproduction

First goal: make the failure happen on demand, not wait for a user to trigger it.

# Use the exact payload from a real failed request (copied from logs) curl -i -X POST https://api.example.com/v1/orders \ -H "Authorization: Bearer sk_live_abc" \ -H "Content-Type: application/json" \ -d '{"user_id":"usr_99","items":[{"sku":"PROD-7","qty":2}],"promo_code":"SAVE10"}' HTTP/2 201 ← success on first try # Retry five more times in a tight loop for i in $(seq 1 5); do curl -s -o /dev/null -w "%{http_code}\n" -X POST https://api.example.com/v1/orders \ -H "Authorization: Bearer sk_live_abc" -H "Content-Type: application/json" \ -d '{"user_id":"usr_99","items":[{"sku":"PROD-7","qty":2}],"promo_code":"SAVE10"}'; done 201 201 201 500 201 # Confirmed: reproducible, intermittent. ~1 in 5 in this test. Not every request fails.

Key observation: the failure is reproducible but not deterministic. That pattern — intermittent failures with no obvious trigger per request — points to concurrency, connection pooling, or a resource that is shared across requests and occasionally exhausted.

Step 2 — Isolate by layer: bypass each layer in turn

The symptom is a 500, which comes from the server layer. But is it the app server or something downstream (a database, a cache, a third-party call)? Start by confirming the server does log the 500 (ruling out a gateway-only failure).

# Check the application log for the failed request's request ID # (extract the x-request-id from a failed curl response) curl -i -X POST https://api.example.com/v1/orders \ -H "Authorization: Bearer sk_live_abc" -H "Content-Type: application/json" \ -d '{"user_id":"usr_99","items":[{"sku":"PROD-7","qty":2}],"promo_code":"SAVE10"}' 2>&1 | grep -E "HTTP/|x-request-id" HTTP/2 500 x-request-id: req_8c3f1a # Now search the app log for that request ID grep "req_8c3f1a" /var/log/app/orders.log 2024-06-15T14:08:34Z ERROR req_8c3f1a POST /v1/orders 500 RuntimeError: connection pool exhausted (timeout=30s) # The 500 IS logged — the gateway passed the request through. This is a server-layer failure. # "connection pool exhausted" is the key phrase.

Layer verdict: gateway is healthy (the request reached the app), network is healthy (no DNS or TLS errors), client is healthy (curl and the real app both fail the same way). The failure is in the server layer, specifically: a shared connection pool to some downstream resource.

Step 3 — Hypothesise: one testable claim

Written down in a scratch file at 14:09:

Hypothesis: The promo-code validation step holds a database connection for longer than expected (or doesn't release it in the error path), causing the connection pool to exhaust under moderate concurrency. Prediction: removing the promo_code field from the request will stop the 500s; the per-request latency for the promo lookup will be elevated in metrics.

A narrower alternative hypothesis (to test after): "The orders service's pool size (currently 10) is simply too small for the current traffic volume, and the promo lookup is irrelevant." The first hypothesis is more testable right now because it predicts a specific field change will change behavior.

Step 4 — Check evidence: logs, metrics, trace

# Test hypothesis: retry without promo_code for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" -X POST https://api.example.com/v1/orders \ -H "Authorization: Bearer sk_live_abc" -H "Content-Type: application/json" \ -d '{"user_id":"usr_99","items":[{"sku":"PROD-7","qty":2}]}'; done 201 201 201 201 201 201 201 201 201 201 # Zero failures in 10 requests without promo_code. Strong signal that promo lookup is involved. # Confirm in Honeycomb: average duration of promo_code validation in the last hour # (the trace span is named "validate_promo") # Dashboard shows: p50 = 12 ms p95 = 340 ms p99 = 28 000 ms (28 s) → p99 promo validation latency is 28 s — approaching the pool timeout of 30 s. → Look at the promo service logs for 14:05–14:15 UTC grep "SLOW_QUERY\|lock wait" /var/log/db/promos.log | tail -20 2024-06-15T14:07:44Z SLOW_QUERY 26832ms SELECT * FROM promo_redemptions WHERE promo_code='SAVE10' 2024-06-15T14:08:01Z SLOW_QUERY 27104ms SELECT * FROM promo_redemptions WHERE promo_code='SAVE10' # The promo_redemptions table is doing a full-table scan. No index on promo_code. # The table has 4M rows. A code deployed at 13:55 inserted a batch of 2M new rows — table stats stale.

Evidence summary: the promo-code lookup is the root cause, not connection-pool size. The query was fast when the table had 2M rows; after the batch load doubled the table it started doing sequential scans, ballooning latency until connections timed out waiting. Updated hypothesis confirmed.

Step 5 — Fix: smallest change that addresses the evidence

# Immediate mitigation: add the missing index (non-blocking on Postgres 14+) CREATE INDEX CONCURRENTLY idx_promo_redemptions_promo_code ON promo_redemptions (promo_code); # Takes ~90 s to build (non-blocking — does not lock the table). # Monitor the slow-query log while it builds. CREATE INDEX # Index built. Verify the query plan now uses the index: EXPLAIN ANALYZE SELECT * FROM promo_redemptions WHERE promo_code='SAVE10'; Index Scan using idx_promo_redemptions_promo_code on promo_redemptions (cost=0.56..8.57 rows=1 width=76) (actual time=0.041..0.042 rows=0 loops=1) Planning Time: 0.3 ms Execution Time: 0.1 ms # From 28 000 ms → 0.1 ms. Deploy nothing — the index is the entire fix.

Step 6 — Regression test: lock the fix in

# Confirm error rate returns to baseline immediately after index creation # (read from the same metrics dashboard) # 14:12 UTC: error rate on POST /orders → 0.0 % # p99 promo validation latency → 6 ms # Verify 10 consecutive requests all succeed for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" -X POST https://api.example.com/v1/orders \ -H "Authorization: Bearer sk_live_abc" -H "Content-Type: application/json" \ -d '{"user_id":"usr_99","items":[{"sku":"PROD-7","qty":2}],"promo_code":"SAVE10"}'; done 201 201 201 201 201 201 201 201 201 201 # Add a test to the migration script that runs after any promo_redemptions migration: # CI step: EXPLAIN SELECT * FROM promo_redemptions WHERE promo_code='X' # and assert it contains "Index Scan" (not "Seq Scan")

Total elapsed investigation time: 14 minutes from alert to root-cause confirmed; 17 minutes to fix deployed. The loop made every minute countable — each step produced a decision, not just activity.

✅ What made this session work

Three things kept the session tight: (1) reproducing with a specific curl command before checking anything else — this gave a reliable signal to test against; (2) writing the hypothesis down with a concrete prediction ("removing promo_code will stop the failures"), which could be tested in under a minute; (3) following the evidence trail from the slow-query log to the missing index rather than jumping straight to "increase the connection pool size" (a common wrong fix for pool exhaustion).

🧠 Quick check

1. You get a 502 from the API and there is no corresponding error entry in the application server's log. Which layer is the most likely culprit?

If the application server never logged the request, it likely never received it. A 502 with no server-side log is the signature of a gateway-level failure — the gateway tried to contact the upstream server and failed.

2. You want to rule out your client SDK as the cause of a failure. What is the right first move?

The fastest way to rule out a layer is to bypass it. If the raw curl also fails, the SDK is not the cause. If curl succeeds, the bug is in how the SDK constructs the request.

3. A valid hypothesis for step 3 of the debugging loop is:

A hypothesis must make a specific, testable prediction. Option A names the cause, identifies how to verify it, and predicts the fix. Options B and C are descriptions of a symptom, not a hypothesis — they don't predict what evidence to look for.

4. Why should you write a regression test after fixing a bug?

Documentation is valuable, but the primary purpose of a regression test is mechanical prevention: it makes the system tell you immediately if the bug comes back, rather than relying on a human noticing it in a log or a user reporting it again.

✍️ Exercise: map a real incident to the loop

Pick any API failure you've encountered — or use this scenario: a user reports that creating an order worked yesterday but returns a 500 Internal Server Error today. The error message in the response body is "unexpected nil pointer".

Write out all six steps of the debugging loop for this failure. For each step, write: (a) what action you'd take, and (b) what you'd do if that step didn't give you useful information.

Model answer:

  1. Reproduce. Run curl -i -X POST …/orders -d '{"items":[]}'. Confirm the 500. If you can't reproduce: check whether it's user-specific (auth issue?) or time-specific (a deploy happened?).
  2. Isolate. There is a server-side error message ("nil pointer"), so this is in the server layer. The client layer is not involved. Check: does it happen for all users, or only this one? All item lists, or only empty ones?
  3. Hypothesise. "The order-creation handler dereferences a pointer without a nil check, and an empty items array is triggering a code path that wasn't tested."
  4. Evidence. Pull the server log for the request ID. Find the stack trace. Identify the exact line and the nil pointer. Check git log for changes to the orders handler in the last 24 hours.
  5. Fix. Add a nil check and a guard clause for empty items arrays. Return 422 Unprocessable Entity with a clear error message rather than crashing.
  6. Regression test. Add a unit test: POST /orders with {"items":[]} must return 422, not 500.

Rubric: ✓ Reproduce is a specific curl command, not "try the API" ✓ Isolate identifies the correct layer (server) ✓ Hypothesis is testable and specific ✓ Evidence step points to the stack trace and git log ✓ Fix is minimal (nil check, not a rewrite) ✓ Regression test covers the exact boundary condition that caused the crash.

Key takeaways

Sources & further reading