Debugging & Real-World · Lesson 02
Reading errors & status codes
A status code is a first clue, not a complete diagnosis. Knowing what 401 vs 403, 502 vs 504, and "connection refused" vs "connection reset" actually mean — and which layer each implicates — cuts the time between "something broke" and "I know exactly where to look" from an hour to a minute.
By the end you'll be able to
- State what each common 4xx and 5xx code means and who is responsible for the failure.
- Distinguish timeout, connection-refused, connection-reset, and DNS failure — and name the layer each points to.
- Read a raw terminal output and immediately identify the next diagnostic step.
The first thing a status code tells you: whose fault is it?
HTTP status codes are grouped into classes by their leading digit. The most important debugging split is between 4xx (the request had a problem) and 5xx (the server had a problem). This determines where you look next:
- 4xx — your request was received and understood, but rejected. The problem is in the request: wrong credentials, missing fields, bad format, rate limit exceeded. You need to fix the call.
- 5xx — the server received the request but failed to handle it. The problem is on the server side. You might fix it yourself (if you own the server) or wait for a provider to fix it (if you don't).
This sounds simple, but it's often the first thing overlooked under pressure. Before you dive into server logs, check the status class. A 400 means your log-diving is wasted — the server is working fine and telling you exactly what was wrong with the request.
The 4xx codes you'll hit repeatedly
400 Bad Request — the request was syntactically wrong
The server understood the HTTP protocol but couldn't parse or validate the payload. Typical causes: malformed JSON (a trailing comma, an unquoted string), a required field that was omitted, or a type mismatch (a string where an integer was expected). The response body usually says exactly what was wrong — read it before doing anything else.
401 vs 403 — authentication vs authorisation
These two are confused constantly. The key is that they answer different questions:
- 401 answers "Who are you?" — the server could not identify the caller. The token is absent, malformed, or expired. The fix: authenticate and get a fresh token. The
WWW-Authenticateheader in the response tells you the expected auth scheme. - 403 answers "Can you do this?" — the server knows exactly who you are, but you don't have permission for this specific operation. Rotating your token will not help; you need a different role or scope.
404 Not Found — the path was wrong, or the resource is gone
The most common cause is a typo in the URL, a version prefix mismatch (/v2/ vs /v1/), or a resource that was deleted. A 404 from a GET is usually harmless to diagnose — but a 404 from a DELETE or PUT on a resource you believe exists can signal a race condition or a caching problem.
409 Conflict — the request is valid but contradicts current state
The request was correctly formed and authenticated, but executing it would create an inconsistency. Classic examples: trying to create a user with an email address that already exists; trying to transition an order from "shipped" back to "pending". The response body should tell you what state was in conflict. Check your idempotency key handling — if you're replaying a request that already succeeded, a 409 may be the correct response (depending on the API's design).
422 Unprocessable Entity — valid syntax, failed business validation
The payload was syntactically correct JSON/XML, but the values failed semantic validation: a payment amount below the minimum, an end date before the start date, a prohibited combination of fields. APIs that use 422 are being precise — they distinguish "I couldn't parse this" (400) from "I parsed it but the values are logically wrong" (422). Some APIs use 400 for both; check the documentation.
429 Too Many Requests — rate limit exceeded
You've sent more requests in a window than the API allows. The response typically includes headers telling you how long to wait. Never retry a 429 immediately — that sends another request and may extend the backoff window. Read the lesson on handling 429s for the full treatment.
The 5xx codes: something went wrong on the server
502 and 504 from a gateway — what they really mean
When you're calling an external API and you get a 502 or 504, the API provider's gateway (nginx, AWS ALB, Cloudflare, etc.) is telling you it couldn't complete the request on your behalf:
- 502 Bad Gateway: the gateway reached the upstream server, but the upstream returned something invalid — a connection reset, a garbled response, or a crash before headers were sent. The upstream server is reachable but behaving wrongly.
- 504 Gateway Timeout: the gateway reached out to the upstream but the upstream was too slow. The gateway gave up waiting before a response arrived. This usually means the upstream is overloaded or a query/operation is taking longer than the gateway's timeout is configured to allow.
Neither 502 nor 504 means the gateway itself is broken. They mean the gateway is working correctly and reporting a problem it detected with the backend. Check the provider's status page. If the provider's status page says "operational" and you're getting 504, the upstream may be fine for most requests but slow on the specific operation you're doing (a heavy query, a large file upload).
Network failures: below the HTTP layer
Not every failure produces an HTTP response. Some failures happen at the TCP/DNS layer — the connection never gets established — and your tool reports them as error messages rather than status codes.
| Error message | What it means | Layer / next check |
|---|---|---|
Connection refused (ECONNREFUSED) |
TCP connection reached the host but the port is not listening. The server process may be stopped, or you have the wrong port. | Network/server. Is the service running? Is the port correct? |
Connection reset (ECONNRESET) |
The server actively closed the connection mid-stream. Often: TLS mismatch, server crash during a request, or a firewall tearing down an idle connection. | Network/gateway. Check TLS version, firewall rules, gateway keep-alive config. |
Connection timed out (ETIMEDOUT) |
TCP SYN sent, no response. The host is unreachable or a firewall is silently dropping packets. | Network. Wrong IP/host, firewall rule, or the server is completely down. |
Could not resolve host (DNS NXDOMAIN) |
DNS lookup failed — the hostname returned no A record. Typo in the hostname, or DNS propagation delay after a change. | DNS. Run dig api.example.com to verify. |
SSL certificate verify failed |
The server's TLS certificate is expired, self-signed, or for a different hostname. | TLS layer. Check openssl s_client -connect api.example.com:443. |
The symptom-to-cause table
In a real incident, you observe a symptom first. Use this table as a quick-reference first-pass:
| What you see | Most likely cause | First thing to check |
|---|---|---|
| 400 with an error body | Malformed or invalid request | Read the error message — it usually names the field |
| 401 | Missing, expired, or malformed credentials | Inspect the token's exp claim; check WWW-Authenticate header |
| 403 | Authenticated but lacking permission/scope | Check the required scope in the API docs; verify the token's scopes |
| 404 on a known resource | Wrong URL version, typo, or resource deleted | Compare the path to the API reference; check if the resource was recently deleted |
| 409 | Request conflicts with current resource state | Read the error body for the conflicting state; check for duplicate submissions |
| 422 | Valid JSON but business rule violated | Read the error body; check each field against the API's constraints |
| 429 | Rate limit exceeded | Read Retry-After or X-RateLimit-Reset; back off before retrying |
| 500 | Server-side bug or unhandled exception | Check server logs for stack trace; check recent deployments |
| 502, no server log | Gateway couldn't reach upstream | Check upstream server status; look for process crash |
| 504, no server log | Gateway timed out waiting for upstream | Check for slow queries; consider whether the operation is inherently slow |
| Connection refused | Port not listening | Is the service running? Is the port number correct? |
| DNS failure | Hostname doesn't resolve | dig api.example.com; check for typos |
| Connection timeout | Unreachable or firewall dropping packets | Verify IP routing; check firewall egress rules |
Interviewers often hand you a log line or a terminal output and ask "what's wrong and where would you look?" Lead with the status class ("this is a 4xx, so the problem is in the request"), then name the specific code, then describe what evidence you'd check next. A 401/403 distinction answer that explains "I need to check whether they're unauthenticated or unauthorised" signals senior-level precision — many candidates conflate the two.
Treating a 502 or 504 from a gateway as proof that "the API is down." Those codes mean the gateway is working correctly and reporting a backend problem. The API may be partially functional — other endpoints may be fine, or the problem may be limited to specific request types. Always check the provider's status page and try a simpler endpoint (like a health check) before escalating.
A well-designed API packs most of the diagnostic signal into the response body — the error code, the specific field that failed, the violated constraint. The status code is the section header; the body is the sentence. Always run curl -i (not just curl) so you see headers, and always look at the full response before pulling logs or changing code.
Under the hood: how each failure is produced at the protocol level
Every error message a developer sees is the end of a specific protocol-level event. Understanding the exact mechanism — the TCP exchange, the gateway decision logic — turns a vague message into a precise pointer to the failure point. Below is how each major failure class actually happens on the wire.
Connection refused — TCP RST in response to SYN
When you run curl https://api.example.com:443/v1/ping, your OS opens a TCP socket and sends a SYN segment to port 443 of the server's IP. If nothing is listening on that port, the server's kernel sends back a RST (reset) segment immediately — no application code is involved, no data is exchanged, and the connection is terminated before it was ever established. The round-trip takes under 1 ms.
The distinguishing feature: the RST arrives almost instantly (sub-millisecond on a LAN, a few milliseconds over the internet). The host is reachable — packets are getting through — but the kernel actively rejects the connection because no process has called listen() on that port.
Connection timeout — SYN sent, no SYN-ACK ever arrives
If the packet is dropped silently — by a firewall, a misconfigured security group, or a network that simply cannot route to the destination — the SYN leaves your machine and disappears. The server never replies. Your OS retransmits the SYN several times (the default is 3–6 retransmits with exponential backoff), then gives up. This takes 20–127 seconds depending on the OS's TCP retransmission timeout settings.
Practical diagnostic: if a connection that used to work now times out, look for a change in firewall rules, VPC security groups, or network ACLs. "Refused" means the host is reachable and the kernel is there; "timeout" means the packets are not getting through at all.
Connection reset mid-stream — RST during data transfer
A reset can also happen after the TCP three-way handshake and even after HTTP data has started flowing. The most common causes: the server-side process crashes or is killed (SIGKILL), a TLS version mismatch aborts the handshake after the SYN-ACK, a firewall with stateful inspection tears down an idle connection, or an intermediate proxy closes an idle keep-alive connection before the client expected.
502 Bad Gateway — the proxy got a bad response from upstream
A reverse proxy (nginx, AWS ALB, Cloudflare) sits between your client and the application server. When you request POST /v1/orders, the proxy forwards that request to an upstream server. A 502 means the proxy successfully connected to the upstream, but the upstream returned something unusable — either a connection reset, a blank response, a garbled HTTP status line, or the upstream crashed mid-response before sending valid headers.
From the proxy's perspective, it performed its job (forwarded the request) but could not produce a valid response from what it got back. The proxy generates the 502 itself; the upstream server either sent nothing or sent gibberish. The upstream server's own logs may show a crash, an OOM kill, or a process restart at that timestamp.
504 Gateway Timeout — the proxy gave up waiting for upstream
A 504 means the proxy connected to the upstream (or at least initiated the connection), but the upstream did not send a complete response within the proxy's configured upstream timeout. The proxy is working correctly; it's reporting that the upstream was too slow. The upstream might be running (not crashed) but stuck: executing a slow database query, waiting for a downstream service, or simply overwhelmed with requests and not getting to yours.
The speed of the error is diagnostic. A 502 often arrives in under a second (the upstream crashed and sent a RST immediately). A 504 takes exactly as long as the gateway's timeout is configured — commonly 30, 60, or 120 seconds. If your API call fails after precisely 30 seconds every time, that's almost certainly a 504 from a gateway with a 30-second upstream timeout, not a 502 from a crash. Check the gateway's proxy_read_timeout configuration.
🧠 Quick check
1. You call an API endpoint and receive a 401. You refresh your access token and call again. You receive a 403. What happened?
401 means authentication failed — the server didn't recognise who you were. After getting a fresh token, authentication succeeded (no more 401), but now the server knows who you are and has decided you don't have the required permission — hence 403. The fix is to request a token with the right scope, not to refresh again.
2. You get a 504 from an external API provider, and there is no entry in the server logs on your end. What most likely happened?
504 Gateway Timeout means the gateway (not your client, not necessarily a crash) waited past its timeout threshold for the backend to respond. The backend may be running but very slow — an overloaded queue, a slow database query, or a heavy operation that takes longer than the gateway's configured timeout.
3. curl: (7) Failed to connect to api.example.com port 443: Connection refused. What does this tell you?
"Connection refused" means the TCP SYN reached the server's IP address, and the server actively rejected it on port 443 — the port is not open. DNS worked (otherwise you'd see "could not resolve host"). The most likely cause: the web server process is not running, or it's listening on a different port.
4. You receive a 422 with the body {"error":"end_date must be after start_date"}. What kind of problem is this?
422 Unprocessable Entity is used when the request body was syntactically correct (valid JSON) but failed a business rule. The server parsed the dates successfully; it's telling you the dates are in the wrong order. Fix the request values — don't look at server logs or network config.
✍️ Exercise: diagnose the log line
You're handed the following terminal output from a staging environment. Diagnose each line: name the error type, which layer is implicated, and what you would check next.
Model answers:
- A (403): The caller is authenticated (no 401) but the token's scopes don't include
write:invoices. This is an authorisation failure at the server layer. Check which scopes the token was issued with; request a new token with the correct scope. - B (DNS failure): The DNS resolver returned no A record for
api.acme.internal. The hostname doesn't exist in this environment's DNS, or the internal DNS server is unreachable. Rundig api.acme.internal; check the DNS zone configuration for the staging environment. - C (504): The gateway (which issued the 504) timed out waiting for the upstream. There is no server log entry, which confirms the request may never have reached the application, or reached it but didn't return in time. Check for slow database queries or an overloaded service. Look at the upstream service's own logs using the
x-request-idto correlate. - D (400): A request-layer validation failure — the
amountfield was-500, which the API requires to be a positive integer. No server problem. Fix the request to pass a valid positive integer.
Rubric: ✓ Correctly identifies the layer for each (server, DNS, gateway, client request) ✓ Names the specific error type ✓ Proposes a concrete next action ✓ Does not suggest checking server logs for A or D (those are request-side problems).
Key takeaways
- 4xx = fix the request; 5xx = investigate the server. The leading digit tells you whose side the problem is on.
- 401 ≠ 403. 401 is "I don't know who you are" (unauthenticated). 403 is "I know exactly who you are, and you can't do this" (unauthorised).
- 502 and 504 come from the gateway, not the app server. 502 = upstream returned bad/no response; 504 = upstream too slow.
- Network failures have no status code. Connection refused, DNS failure, and TLS errors are below HTTP — diagnose them with curl error messages and separate tools like
digandopenssl s_client. - Always read the response body. The status code is the chapter; the body is the sentence that tells you what specifically failed.