Debugging & Real-World · Lesson 02

Reading errors & status codes

A status code is a first clue, not a complete diagnosis. Knowing what 401 vs 403, 502 vs 504, and "connection refused" vs "connection reset" actually mean — and which layer each implicates — cuts the time between "something broke" and "I know exactly where to look" from an hour to a minute.

⏱ 13 min Difficulty: core Prereq: Lesson 07 (HTTP), dbg-01

By the end you'll be able to

State what each common 4xx and 5xx code means and who is responsible for the failure.
Distinguish timeout, connection-refused, connection-reset, and DNS failure — and name the layer each points to.
Read a raw terminal output and immediately identify the next diagnostic step.

The first thing a status code tells you: whose fault is it?

HTTP status codes are grouped into classes by their leading digit. The most important debugging split is between 4xx (the request had a problem) and 5xx (the server had a problem). This determines where you look next:

4xx — your request was received and understood, but rejected. The problem is in the request: wrong credentials, missing fields, bad format, rate limit exceeded. You need to fix the call.
5xx — the server received the request but failed to handle it. The problem is on the server side. You might fix it yourself (if you own the server) or wait for a provider to fix it (if you don't).

This sounds simple, but it's often the first thing overlooked under pressure. Before you dive into server logs, check the status class. A 400 means your log-diving is wasted — the server is working fine and telling you exactly what was wrong with the request.

The 4xx codes you'll hit repeatedly

4xx codes in order. The key 401 vs 403 distinction: 401 means "I don't know who you are," 403 means "I know who you are but you can't do this."

400 Bad Request — the request was syntactically wrong

The server understood the HTTP protocol but couldn't parse or validate the payload. Typical causes: malformed JSON (a trailing comma, an unquoted string), a required field that was omitted, or a type mismatch (a string where an integer was expected). The response body usually says exactly what was wrong — read it before doing anything else.

curl -i -X POST https://api.example.com/v1/orders \ -H "Content-Type: application/json" \ -d '{"item_id": "prod_42", "quantity": "three"}' HTTP/2 400 {"error":{"code":"invalid_type","field":"quantity","message":"quantity must be an integer"}} # "quantity" was sent as a string. Fix: -d '{"item_id":"prod_42","quantity":3}'

401 vs 403 — authentication vs authorisation

These two are confused constantly. The key is that they answer different questions:

401 answers "Who are you?" — the server could not identify the caller. The token is absent, malformed, or expired. The fix: authenticate and get a fresh token. The WWW-Authenticate header in the response tells you the expected auth scheme.
403 answers "Can you do this?" — the server knows exactly who you are, but you don't have permission for this specific operation. Rotating your token will not help; you need a different role or scope.

# 401 — no token, or expired token curl -i https://api.example.com/v1/users/me HTTP/2 401 WWW-Authenticate: Bearer realm="api" {"error":"missing_or_invalid_token"} # Fix: include Authorization: Bearer <token> and ensure the token hasn't expired # 403 — valid token, wrong scope curl -i -H "Authorization: Bearer sk_read_only_abc" \ -X DELETE https://api.example.com/v1/users/99 HTTP/2 403 {"error":"insufficient_scope","required":"users:write","got":"users:read"} # Fix: request a token with the write scope, or check ACL configuration

404 Not Found — the path was wrong, or the resource is gone

The most common cause is a typo in the URL, a version prefix mismatch (/v2/ vs /v1/), or a resource that was deleted. A 404 from a GET is usually harmless to diagnose — but a 404 from a DELETE or PUT on a resource you believe exists can signal a race condition or a caching problem.

409 Conflict — the request is valid but contradicts current state

The request was correctly formed and authenticated, but executing it would create an inconsistency. Classic examples: trying to create a user with an email address that already exists; trying to transition an order from "shipped" back to "pending". The response body should tell you what state was in conflict. Check your idempotency key handling — if you're replaying a request that already succeeded, a 409 may be the correct response (depending on the API's design).

422 Unprocessable Entity — valid syntax, failed business validation

The payload was syntactically correct JSON/XML, but the values failed semantic validation: a payment amount below the minimum, an end date before the start date, a prohibited combination of fields. APIs that use 422 are being precise — they distinguish "I couldn't parse this" (400) from "I parsed it but the values are logically wrong" (422). Some APIs use 400 for both; check the documentation.

429 Too Many Requests — rate limit exceeded

You've sent more requests in a window than the API allows. The response typically includes headers telling you how long to wait. Never retry a 429 immediately — that sends another request and may extend the backoff window. Read the lesson on handling 429s for the full treatment.

HTTP/2 429 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1718380860 Retry-After: 47 {"error":"rate_limit_exceeded"} # Wait 47 seconds (Retry-After) before retrying. Add jitter to avoid a thundering herd.

The 5xx codes: something went wrong on the server

502 and 504 come from a gateway or load balancer that couldn't reach or get a timely response from the backend. 500 and 503 come from the application server itself.

502 and 504 from a gateway — what they really mean

When you're calling an external API and you get a 502 or 504, the API provider's gateway (nginx, AWS ALB, Cloudflare, etc.) is telling you it couldn't complete the request on your behalf:

502 Bad Gateway: the gateway reached the upstream server, but the upstream returned something invalid — a connection reset, a garbled response, or a crash before headers were sent. The upstream server is reachable but behaving wrongly.
504 Gateway Timeout: the gateway reached out to the upstream but the upstream was too slow. The gateway gave up waiting before a response arrived. This usually means the upstream is overloaded or a query/operation is taking longer than the gateway's timeout is configured to allow.

Neither 502 nor 504 means the gateway itself is broken. They mean the gateway is working correctly and reporting a problem it detected with the backend. Check the provider's status page. If the provider's status page says "operational" and you're getting 504, the upstream may be fine for most requests but slow on the specific operation you're doing (a heavy query, a large file upload).

# 502 — upstream crash or connection reset HTTP/2 502 {"error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination"} # → Check upstream server logs. Look for process restarts or OOM kills. # 504 — upstream too slow HTTP/2 504 {"error":"upstream request timeout"} # → Look for slow queries, or check if the gateway timeout config is too short for this operation.

Network failures: below the HTTP layer

Not every failure produces an HTTP response. Some failures happen at the TCP/DNS layer — the connection never gets established — and your tool reports them as error messages rather than status codes.

Error message	What it means	Layer / next check
`Connection refused (ECONNREFUSED)`	TCP connection reached the host but the port is not listening. The server process may be stopped, or you have the wrong port.	Network/server. Is the service running? Is the port correct?
`Connection reset (ECONNRESET)`	The server actively closed the connection mid-stream. Often: TLS mismatch, server crash during a request, or a firewall tearing down an idle connection.	Network/gateway. Check TLS version, firewall rules, gateway keep-alive config.
`Connection timed out (ETIMEDOUT)`	TCP SYN sent, no response. The host is unreachable or a firewall is silently dropping packets.	Network. Wrong IP/host, firewall rule, or the server is completely down.
`Could not resolve host (DNS NXDOMAIN)`	DNS lookup failed — the hostname returned no A record. Typo in the hostname, or DNS propagation delay after a change.	DNS. Run `dig api.example.com` to verify.
`SSL certificate verify failed`	The server's TLS certificate is expired, self-signed, or for a different hostname.	TLS layer. Check `openssl s_client -connect api.example.com:443`.

curl -i https://api.example.com/v1/ping curl: (6) Could not resolve host: api.example.com # DNS failure — nothing was sent. Check the hostname for typos. # Verify with: dig api.example.com curl -i https://api.example.com/v1/ping curl: (7) Failed to connect to api.example.com port 443 after 0 ms: Connection refused # Host resolved, TCP reached the right IP, but port 443 is not listening. # Server process not running, or wrong port. curl -i https://api.example.com/v1/ping curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to api.example.com:443 # TCP connected, TLS handshake started, then torn down. # Possible TLS version mismatch or certificate issue.

The symptom-to-cause table

In a real incident, you observe a symptom first. Use this table as a quick-reference first-pass:

What you see	Most likely cause	First thing to check
400 with an error body	Malformed or invalid request	Read the error message — it usually names the field
401	Missing, expired, or malformed credentials	Inspect the token's `exp` claim; check `WWW-Authenticate` header
403	Authenticated but lacking permission/scope	Check the required scope in the API docs; verify the token's scopes
404 on a known resource	Wrong URL version, typo, or resource deleted	Compare the path to the API reference; check if the resource was recently deleted
409	Request conflicts with current resource state	Read the error body for the conflicting state; check for duplicate submissions
422	Valid JSON but business rule violated	Read the error body; check each field against the API's constraints
429	Rate limit exceeded	Read `Retry-After` or `X-RateLimit-Reset`; back off before retrying
500	Server-side bug or unhandled exception	Check server logs for stack trace; check recent deployments
502, no server log	Gateway couldn't reach upstream	Check upstream server status; look for process crash
504, no server log	Gateway timed out waiting for upstream	Check for slow queries; consider whether the operation is inherently slow
Connection refused	Port not listening	Is the service running? Is the port number correct?
DNS failure	Hostname doesn't resolve	`dig api.example.com`; check for typos
Connection timeout	Unreachable or firewall dropping packets	Verify IP routing; check firewall egress rules

🎯 Interview angle

Interviewers often hand you a log line or a terminal output and ask "what's wrong and where would you look?" Lead with the status class ("this is a 4xx, so the problem is in the request"), then name the specific code, then describe what evidence you'd check next. A 401/403 distinction answer that explains "I need to check whether they're unauthenticated or unauthorised" signals senior-level precision — many candidates conflate the two.

⚠️ Common trap

Treating a 502 or 504 from a gateway as proof that "the API is down." Those codes mean the gateway is working correctly and reporting a backend problem. The API may be partially functional — other endpoints may be fine, or the problem may be limited to specific request types. Always check the provider's status page and try a simpler endpoint (like a health check) before escalating.

✅ Read the response body before checking anything else

A well-designed API packs most of the diagnostic signal into the response body — the error code, the specific field that failed, the violated constraint. The status code is the section header; the body is the sentence. Always run curl -i (not just curl) so you see headers, and always look at the full response before pulling logs or changing code.

Under the hood: how each failure is produced at the protocol level

Every error message a developer sees is the end of a specific protocol-level event. Understanding the exact mechanism — the TCP exchange, the gateway decision logic — turns a vague message into a precise pointer to the failure point. Below is how each major failure class actually happens on the wire.

Connection refused — TCP RST in response to SYN

When you run curl https://api.example.com:443/v1/ping, your OS opens a TCP socket and sends a SYN segment to port 443 of the server's IP. If nothing is listening on that port, the server's kernel sends back a RST (reset) segment immediately — no application code is involved, no data is exchanged, and the connection is terminated before it was ever established. The round-trip takes under 1 ms.

# tcpdump trace: connection refused (nothing listening on port 443) # Run on the client machine while curl fails tcpdump -nn -i eth0 'host api.example.com and port 443' 14:21:03.001 IP 10.0.1.5.54321 > 93.184.216.34.443: Flags [S], seq 2847391048 14:21:03.002 IP 93.184.216.34.443 > 10.0.1.5.54321: Flags [R.], seq 0, ack 2847391049 # SYN sent (Flags [S]) at .001 # RST received (Flags [R.]) at .002 — 1 ms round-trip, port is not open # curl reports: (7) Failed to connect: Connection refused

The distinguishing feature: the RST arrives almost instantly (sub-millisecond on a LAN, a few milliseconds over the internet). The host is reachable — packets are getting through — but the kernel actively rejects the connection because no process has called listen() on that port.

Connection timeout — SYN sent, no SYN-ACK ever arrives

If the packet is dropped silently — by a firewall, a misconfigured security group, or a network that simply cannot route to the destination — the SYN leaves your machine and disappears. The server never replies. Your OS retransmits the SYN several times (the default is 3–6 retransmits with exponential backoff), then gives up. This takes 20–127 seconds depending on the OS's TCP retransmission timeout settings.

# tcpdump trace: timeout (firewall dropping the SYN silently) tcpdump -nn -i eth0 'host 10.5.0.1 and port 8443' 14:25:00.000 IP 10.0.1.5.61234 > 10.5.0.1.8443: Flags [S], seq 3910283748 # ... no reply for 1 second ... 14:25:01.001 IP 10.0.1.5.61234 > 10.5.0.1.8443: Flags [S], seq 3910283748 ← retransmit # ... no reply for 2 seconds ... 14:25:03.002 IP 10.0.1.5.61234 > 10.5.0.1.8443: Flags [S], seq 3910283748 ← retransmit # ... kernel gives up after configured timeout (default ~127 s on Linux) # curl reports: (28) Connection timed out after 30000ms (if --max-time 30 set) # KEY DIFFERENCE vs "refused": no RST — the SYN just keeps retransmitting into silence. # curl's --max-time option fires before the OS timeout on long-running connections.

Practical diagnostic: if a connection that used to work now times out, look for a change in firewall rules, VPC security groups, or network ACLs. "Refused" means the host is reachable and the kernel is there; "timeout" means the packets are not getting through at all.

Connection reset mid-stream — RST during data transfer

A reset can also happen after the TCP three-way handshake and even after HTTP data has started flowing. The most common causes: the server-side process crashes or is killed (SIGKILL), a TLS version mismatch aborts the handshake after the SYN-ACK, a firewall with stateful inspection tears down an idle connection, or an intermediate proxy closes an idle keep-alive connection before the client expected.

# tcpdump trace: connection reset mid-stream (TLS handshake started, then aborted) tcpdump -nn -i eth0 'host api.example.com and port 443' 14:30:00.100 IP client > server: Flags [S] ← SYN 14:30:00.120 IP server > client: Flags [S.] ack 1 ← SYN-ACK 14:30:00.121 IP client > server: Flags [.] ← ACK — TCP established 14:30:00.125 IP client > server: Flags [P.] len 280 ← TLS ClientHello (data starts) 14:30:00.130 IP server > client: Flags [R.] seq ... ← RST — connection torn down # TCP connected, but the TLS layer rejected the client. # Possible: server requires TLS 1.3, client sent TLS 1.2 ClientHello. # curl reports: (35) OpenSSL SSL_connect: Connection reset by peer

502 Bad Gateway — the proxy got a bad response from upstream

A reverse proxy (nginx, AWS ALB, Cloudflare) sits between your client and the application server. When you request POST /v1/orders, the proxy forwards that request to an upstream server. A 502 means the proxy successfully connected to the upstream, but the upstream returned something unusable — either a connection reset, a blank response, a garbled HTTP status line, or the upstream crashed mid-response before sending valid headers.

From the proxy's perspective, it performed its job (forwarded the request) but could not produce a valid response from what it got back. The proxy generates the 502 itself; the upstream server either sent nothing or sent gibberish. The upstream server's own logs may show a crash, an OOM kill, or a process restart at that timestamp.

504 Gateway Timeout — the proxy gave up waiting for upstream

A 504 means the proxy connected to the upstream (or at least initiated the connection), but the upstream did not send a complete response within the proxy's configured upstream timeout. The proxy is working correctly; it's reporting that the upstream was too slow. The upstream might be running (not crashed) but stuck: executing a slow database query, waiting for a downstream service, or simply overwhelmed with requests and not getting to yours.

502: proxy got a response but it was invalid (RST or crash). 504: proxy got nothing before the timeout expired. In both cases the proxy generated the error code, not the upstream app.

⚠️ The 1-second vs 30-second clue

The speed of the error is diagnostic. A 502 often arrives in under a second (the upstream crashed and sent a RST immediately). A 504 takes exactly as long as the gateway's timeout is configured — commonly 30, 60, or 120 seconds. If your API call fails after precisely 30 seconds every time, that's almost certainly a 504 from a gateway with a 30-second upstream timeout, not a 502 from a crash. Check the gateway's proxy_read_timeout configuration.

🧠 Quick check

1. You call an API endpoint and receive a 401. You refresh your access token and call again. You receive a 403. What happened?

401 means authentication failed — the server didn't recognise who you were. After getting a fresh token, authentication succeeded (no more 401), but now the server knows who you are and has decided you don't have the required permission — hence 403. The fix is to request a token with the right scope, not to refresh again.

2. You get a 504 from an external API provider, and there is no entry in the server logs on your end. What most likely happened?

504 Gateway Timeout means the gateway (not your client, not necessarily a crash) waited past its timeout threshold for the backend to respond. The backend may be running but very slow — an overloaded queue, a slow database query, or a heavy operation that takes longer than the gateway's configured timeout.

3. curl: (7) Failed to connect to api.example.com port 443: Connection refused. What does this tell you?

"Connection refused" means the TCP SYN reached the server's IP address, and the server actively rejected it on port 443 — the port is not open. DNS worked (otherwise you'd see "could not resolve host"). The most likely cause: the web server process is not running, or it's listening on a different port.

4. You receive a 422 with the body {"error":"end_date must be after start_date"}. What kind of problem is this?

422 Unprocessable Entity is used when the request body was syntactically correct (valid JSON) but failed a business rule. The server parsed the dates successfully; it's telling you the dates are in the wrong order. Fix the request values — don't look at server logs or network config.

✍️ Exercise: diagnose the log line

You're handed the following terminal output from a staging environment. Diagnose each line: name the error type, which layer is implicated, and what you would check next.

--- Request A --- HTTP/2 403 {"error":"forbidden","detail":"token lacks write:invoices scope"} --- Request B --- curl: (6) Could not resolve host: api.acme.internal --- Request C --- HTTP/2 504 x-request-id: req_8f7a2c (no server log entry for req_8f7a2c) --- Request D --- HTTP/2 400 {"error":"validation_error","field":"amount","message":"must be a positive integer, got -500"}

Model answers:

A (403): The caller is authenticated (no 401) but the token's scopes don't include write:invoices. This is an authorisation failure at the server layer. Check which scopes the token was issued with; request a new token with the correct scope.
B (DNS failure): The DNS resolver returned no A record for api.acme.internal. The hostname doesn't exist in this environment's DNS, or the internal DNS server is unreachable. Run dig api.acme.internal; check the DNS zone configuration for the staging environment.
C (504): The gateway (which issued the 504) timed out waiting for the upstream. There is no server log entry, which confirms the request may never have reached the application, or reached it but didn't return in time. Check for slow database queries or an overloaded service. Look at the upstream service's own logs using the x-request-id to correlate.
D (400): A request-layer validation failure — the amount field was -500, which the API requires to be a positive integer. No server problem. Fix the request to pass a valid positive integer.

Rubric: ✓ Correctly identifies the layer for each (server, DNS, gateway, client request) ✓ Names the specific error type ✓ Proposes a concrete next action ✓ Does not suggest checking server logs for A or D (those are request-side problems).

Key takeaways

4xx = fix the request; 5xx = investigate the server. The leading digit tells you whose side the problem is on.
401 ≠ 403. 401 is "I don't know who you are" (unauthenticated). 403 is "I know exactly who you are, and you can't do this" (unauthorised).
502 and 504 come from the gateway, not the app server. 502 = upstream returned bad/no response; 504 = upstream too slow.
Network failures have no status code. Connection refused, DNS failure, and TLS errors are below HTTP — diagnose them with curl error messages and separate tools like dig and openssl s_client.
Always read the response body. The status code is the chapter; the body is the sentence that tells you what specifically failed.