Debugging & Real-World · Lesson 01
The API debugging mindset
Every API bug looks unique until you realise you're always answering the same question: which layer lied? A systematic loop — reproduce, isolate, hypothesise, verify, fix, protect — turns a wall of mystery into a series of answerable questions.
By the end you'll be able to
- Apply a six-step debugging loop to any API failure without guessing.
- Identify which layer — client, network, gateway, server, or database — is the most likely source of a given error.
- Choose the right tool (curl, browser DevTools, logs, status pages) for each stage of an investigation.
Why systematic beats intuitive
Intuition is experience compressed into a shortcut. It's valuable — but it only works when the bug fits a pattern you've already seen. The first time you hit a problem, you have no pattern. Worse, acting on a wrong hunch first can add confusion: you change something, the symptom shifts slightly, and now you're debugging your fix instead of the original failure.
The systematic loop below is designed to work on bugs you've never encountered. It doesn't require you to know the answer in advance; it requires you to know where to look next. That's a learnable skill, and it transfers to every API, every language, and every company's stack.
The debugging loop
Think of debugging as a scientific method running under time pressure. Each step produces a testable claim; the next step tests it.
Step 1 — Reproduce: confirm it's real before you touch anything
Never start debugging without a reproduction. If you can't make the failure happen on demand, you can't tell whether your fix worked. Write the smallest possible reproduction: the exact curl command, the exact request body, the exact sequence of events that triggers the symptom. If the report is "sometimes it fails," find the condition that makes "sometimes" into "always."
Step 2 — Isolate by layer
An API call crosses many layers on its way from your code to the database and back. A symptom in one layer is often caused by a different layer. The fastest way to narrow down is to move your test point closer to the source of truth until the symptom disappears — that boundary is where the bug lives.
The layers are described in detail in Lesson 03 — Layers & the Narrow Waist. For debugging purposes, what matters is that each layer has a distinct signature: client bugs show up with an identical library and a clean curl; network bugs show up even when the server is healthy; gateway bugs produce 502/504 with no corresponding server-side log entry; server bugs have stack traces; database bugs have slow-query logs or connection errors.
Step 3 — Hypothesise: one testable claim at a time
A hypothesis is not a guess — it's a claim that predicts observable evidence. "I think the token is expired" is a hypothesis because it predicts: checking the token's exp claim will show a past timestamp, and refreshing the token will make the 401 go away. A hypothesis that doesn't predict anything specific is not a hypothesis; it's a hunch, and hunches lead to random changes.
Write the hypothesis down — even a one-liner in a scratch file. This forces precision and stops you from drifting mid-investigation.
Step 4 — Check evidence: logs, metrics, traces
This is where most of the work happens. Collect only the evidence that tests your current hypothesis; don't dump every log line hoping something jumps out. The three instruments complement each other:
- Logs tell you what happened at a specific moment on a specific server — ideal for one-off failures and error messages with stack traces.
- Metrics tell you how much something happened over time — ideal for "it's slow" or "it started failing at 14:32 and is still failing."
- Traces tell you where time went across services in a single request — ideal for distributed systems where a single call fans out to five microservices.
Also check the provider's status page first if the API is external. Spending an hour debugging a provider outage is pure waste.
Step 5 — Fix: smallest change that addresses the evidence
Match the size of your fix to the size of the confirmed cause. If the evidence shows a missing required field, add the field — don't refactor the entire request-building module. Narrow fixes are safer (less chance of introducing a new bug), easier to roll back, and easier for a code reviewer to evaluate. If you find yourself wanting to fix five things at once, you haven't isolated far enough.
Step 6 — Regression test: lock the fix in
A fix without a test is a debt with no due date. The next developer to touch that code — possibly you in six months — has no way to know the constraint exists. Write the smallest test that would have caught the bug: an integration test that sends the malformed request and asserts on the specific error code, or a unit test that exercises the boundary condition you just discovered.
Tools for each stage
| Tool | Best for | Watch out for |
|---|---|---|
curl -i | Reproducing without any client library; checking raw headers | Shell quoting around JSON bodies; missing -i loses response headers |
| Browser DevTools Network tab | Seeing exactly what the browser sent/received; timing waterfall | CORS errors look like network errors — check the Console tab too |
| Server/app logs | Stack traces, request IDs, exact SQL executed | Log level set to WARN might suppress the INFO line you need |
| Provider status page | Ruling out a third-party outage in seconds | Status pages are often lagging; check community forums if the page says "operational" |
| Distributed trace (Jaeger, Zipkin, Honeycomb) | Multi-service latency; finding where time goes in a fan-out call | Sampling rates: if only 1 in 100 requests is traced, a rare bug may never appear |
When an interviewer presents a debugging scenario, they're testing your process more than your knowledge of the specific error. Narrate the loop out loud: "First I'd reproduce it with a raw curl to rule out client library issues, then check whether the error appears in the server logs…" Demonstrating systematic thinking under pressure is the signal they're looking for. Arriving at a correct diagnosis quickly with a clear explanation is worth more than guessing the right answer silently.
Your mobile app starts getting reports of intermittent payment failures. The error in your logs reads upstream connect error or disconnect/reset before headers. Apply the loop.
- Reproduce. Call the payment endpoint directly with curl, bypassing the mobile SDK. If curl also fails, the client layer is not the culprit.
- Isolate. The error message ("reset before headers") is a gateway-level signal — it means the upstream server closed the connection before sending any response. Check: is the gateway healthy? Is this a 502 or a connection-level failure?
- Hypothesise. "The payment processor's server is returning a connection reset — possibly an overloaded upstream or a TLS mismatch between the gateway and the upstream."
- Check evidence. Pull gateway access logs; look for 502 errors clustering on specific upstream IP. Check the payment processor's status page. Look at the TLS handshake timeout in the gateway config.
- Fix. If the upstream is healthy but TLS is timing out: update the gateway's upstream TLS timeout from 5 s to 10 s (match the payment processor's documented SLA). Deploy to staging, reproduce test passes.
- Regression test. Add a synthetic monitor that calls the payment endpoint from the gateway's network segment every 60 s and alerts on a connection-reset error, distinct from a normal application error.
Changing two things at once. If you update the timeout and rotate the API keys simultaneously, and the bug disappears, you don't know which change fixed it. That ambiguity is debt: the next time the bug reappears (or the next person debugging something similar) has no actionable record. Change one variable per investigation step.
Even a simple scratch file — timestamp, hypothesis, what you checked, what you found — is enormously valuable. It prevents you from re-testing things you've already ruled out, and it gives you the material to write a clear post-mortem. "I checked X at 14:15, it was fine" is evidence. "I don't think X is the issue" is a hunch.
Under the hood: a fully worked debugging session
Reading about the loop is one thing; watching it run in real time against a real symptom is another. Below is a concrete, narrated session for a specific failure: intermittent 500s on POST /orders. This is the kind of thing a senior engineer actually does — each step produces evidence that shapes the next move.
At 14:07 the on-call engineer receives an alert: error rate on POST /orders jumped from 0.1 % to 8 % over the last five minutes. Users are reporting "Something went wrong" on checkout. The 500s are intermittent — about 1 in 12 requests fails.
Step 1 — Reproduce: nail down a minimal reproduction
First goal: make the failure happen on demand, not wait for a user to trigger it.
Key observation: the failure is reproducible but not deterministic. That pattern — intermittent failures with no obvious trigger per request — points to concurrency, connection pooling, or a resource that is shared across requests and occasionally exhausted.
Step 2 — Isolate by layer: bypass each layer in turn
The symptom is a 500, which comes from the server layer. But is it the app server or something downstream (a database, a cache, a third-party call)? Start by confirming the server does log the 500 (ruling out a gateway-only failure).
Layer verdict: gateway is healthy (the request reached the app), network is healthy (no DNS or TLS errors), client is healthy (curl and the real app both fail the same way). The failure is in the server layer, specifically: a shared connection pool to some downstream resource.
Step 3 — Hypothesise: one testable claim
Written down in a scratch file at 14:09:
Hypothesis: The promo-code validation step holds a database connection for longer than expected (or doesn't release it in the error path), causing the connection pool to exhaust under moderate concurrency. Prediction: removing the promo_code field from the request will stop the 500s; the per-request latency for the promo lookup will be elevated in metrics.
A narrower alternative hypothesis (to test after): "The orders service's pool size (currently 10) is simply too small for the current traffic volume, and the promo lookup is irrelevant." The first hypothesis is more testable right now because it predicts a specific field change will change behavior.
Step 4 — Check evidence: logs, metrics, trace
Evidence summary: the promo-code lookup is the root cause, not connection-pool size. The query was fast when the table had 2M rows; after the batch load doubled the table it started doing sequential scans, ballooning latency until connections timed out waiting. Updated hypothesis confirmed.
Step 5 — Fix: smallest change that addresses the evidence
Step 6 — Regression test: lock the fix in
Total elapsed investigation time: 14 minutes from alert to root-cause confirmed; 17 minutes to fix deployed. The loop made every minute countable — each step produced a decision, not just activity.
Three things kept the session tight: (1) reproducing with a specific curl command before checking anything else — this gave a reliable signal to test against; (2) writing the hypothesis down with a concrete prediction ("removing promo_code will stop the failures"), which could be tested in under a minute; (3) following the evidence trail from the slow-query log to the missing index rather than jumping straight to "increase the connection pool size" (a common wrong fix for pool exhaustion).
🧠 Quick check
1. You get a 502 from the API and there is no corresponding error entry in the application server's log. Which layer is the most likely culprit?
If the application server never logged the request, it likely never received it. A 502 with no server-side log is the signature of a gateway-level failure — the gateway tried to contact the upstream server and failed.
2. You want to rule out your client SDK as the cause of a failure. What is the right first move?
The fastest way to rule out a layer is to bypass it. If the raw curl also fails, the SDK is not the cause. If curl succeeds, the bug is in how the SDK constructs the request.
3. A valid hypothesis for step 3 of the debugging loop is:
A hypothesis must make a specific, testable prediction. Option A names the cause, identifies how to verify it, and predicts the fix. Options B and C are descriptions of a symptom, not a hypothesis — they don't predict what evidence to look for.
4. Why should you write a regression test after fixing a bug?
Documentation is valuable, but the primary purpose of a regression test is mechanical prevention: it makes the system tell you immediately if the bug comes back, rather than relying on a human noticing it in a log or a user reporting it again.
✍️ Exercise: map a real incident to the loop
Pick any API failure you've encountered — or use this scenario: a user reports that creating an order worked yesterday but returns a 500 Internal Server Error today. The error message in the response body is "unexpected nil pointer".
Write out all six steps of the debugging loop for this failure. For each step, write: (a) what action you'd take, and (b) what you'd do if that step didn't give you useful information.
Model answer:
- Reproduce. Run
curl -i -X POST …/orders -d '{"items":[]}'. Confirm the 500. If you can't reproduce: check whether it's user-specific (auth issue?) or time-specific (a deploy happened?). - Isolate. There is a server-side error message ("nil pointer"), so this is in the server layer. The client layer is not involved. Check: does it happen for all users, or only this one? All item lists, or only empty ones?
- Hypothesise. "The order-creation handler dereferences a pointer without a nil check, and an empty items array is triggering a code path that wasn't tested."
- Evidence. Pull the server log for the request ID. Find the stack trace. Identify the exact line and the nil pointer. Check git log for changes to the orders handler in the last 24 hours.
- Fix. Add a nil check and a guard clause for empty items arrays. Return
422 Unprocessable Entitywith a clear error message rather than crashing. - Regression test. Add a unit test:
POST /orderswith{"items":[]}must return 422, not 500.
Rubric: ✓ Reproduce is a specific curl command, not "try the API" ✓ Isolate identifies the correct layer (server) ✓ Hypothesis is testable and specific ✓ Evidence step points to the stack trace and git log ✓ Fix is minimal (nil check, not a rewrite) ✓ Regression test covers the exact boundary condition that caused the crash.
Key takeaways
- The debugging loop — reproduce → isolate → hypothesise → evidence → fix → regress — works on any failure because it doesn't require pre-existing knowledge of the specific bug.
- Reproduce first, always. A fix you can't verify is a guess.
- Each layer has a distinct signature. No server-side log entry + 502 = gateway; stack trace in server log = server layer; TLS handshake failure = network/TLS layer.
- A hypothesis must be falsifiable: it predicts specific evidence, not just a direction to look.
- Change one thing at a time. Parallel changes make it impossible to know what worked.
- A regression test is the proof that the loop completed; without it, the fix is provisional.