Platform & API Product Engineering · Lesson 03
Designing your API error model
Your error contract is not an afterthought — it is the part of your API that determines how quickly an integration partner can diagnose and recover from a problem at 2 a.m. A badly designed error model makes every bug an escalation; a well-designed one lets clients self-serve.
By the end you'll be able to
- Design a complete error envelope with the fields needed for machine-readable branching, human debugging, and support tracing.
- Map every domain error to the right HTTP status code, and explain the three-layer model: transport status, machine code, and human message.
- Classify errors as client, transient, or business — and signal retryability in the response so clients can act without guessing.
Why error design matters as much as success design
Most API design effort goes into the happy path: the right resource shape, sensible field names, a clean pagination cursor. The error path gets an afterthought — a string message and a 400. But the error path is where integration problems live. When a payment fails, a webhook bounces, or a validation check rejects a request, the quality of your error response determines whether the developer can fix it in minutes or files a support ticket.
Think of the error response as a patient chart at a hospital. The HTTP status code is the triage category on the door — "urgent" or "stable". The machine-readable error code is the diagnosis — precise and stable so a computer can route it. The human message is the doctor's note — readable but not something software should parse, because next month the wording might change. The request ID is the chart number that lets support pull the full record. Strip any of these layers and the chart becomes useless to half the people who need it.
The error envelope
A well-designed error body wraps a single error object at the top level. The outer key keeps success and error shapes clearly separated — a client can check if ("error" in body) to know which branch to take, rather than peeking at status codes already captured at the transport layer.
// Full error envelope — canonical structure
{
"error": {
// MACHINE layer — clients branch on these; they are CONTRACT-STABLE
"type": "invalid_request_error", // top-level category
"code": "amount_too_small", // stable, specific sub-code
// HUMAN layer — readable prose; NEVER parse this string
"message": "Amount must be at least 50 cents (USD).",
// FIELD layer — for validation errors; which param is wrong?
"param": "amount",
// SUPPORT layer — trace it in your logs; paste it in a ticket
"request_id": "req_9GqX2mHk7fP3n",
"doc_url": "https://docs.example.com/errors/amount_too_small"
}
}
The three layers: status, code, message
The error contract has three distinct consumers. HTTP status codes are consumed by infrastructure — proxies, load balancers, CDNs, monitoring systems. Machine-readable codes are consumed by client code — the if code == "card_declined" branch. Human messages are consumed by people — the developer reading the log at 2 a.m. Mixing these layers causes problems: if your client branches on message text, a copy edit to fix a typo breaks your integration.
The three layers explained
| Layer | Consumer | Stability contract | Examples |
|---|---|---|---|
| HTTP status | Proxies, CDNs, monitoring, client retry logic | Stable forever — HTTP statuses are standardised by the IETF | 422, 429, 503 |
Machine code (code field) | Client application code (if branches, switch statements) | Contract-stable — you never remove or rename a code; add new ones as needed | card_declined, rate_limit_exceeded, insufficient_funds |
| Human message | Developers reading logs or consoles; end users seeing UI copy | Unstable — you may reword it, translate it, or add detail; code must never parse it | "Amount must be at least 50 cents." |
A common integration bug: if (error.message.includes("card was declined")). This is fragile. The moment your team fixes the grammar — "Your card was declined" — the client branch silently stops matching. Use error.code === "card_declined". The code is stable; the message is not.
Mapping domain errors to HTTP status codes
HTTP status codes are the transport signal. They must be chosen based on what the error means to the protocol, not just what the error means to your domain. Many teams default to 400 for every client error, which is too coarse. A misconfigured rate limiter that silently returns 400 instead of 429 prevents every well-behaved SDK from triggering its automatic retry logic.
| Domain scenario | Correct status | Why not the common wrong choice |
|---|---|---|
| Request body fails schema validation (wrong type, missing required field) | 422 Unprocessable Entity | Not 400. 400 means "the request was malformed at the HTTP level" (bad JSON, illegal headers). 422 means "the syntax is fine but the semantics are invalid." |
| Missing or invalid credentials | 401 Unauthorized | Not 403. 401 means "we don't know who you are; authenticate." The browser knows to prompt for credentials on 401. |
| Valid credentials, but insufficient permission | 403 Forbidden | Not 401. 403 means "we know exactly who you are and the answer is still no." Re-authenticating won't help. |
| Client tries to create a resource that already exists (duplicate key) | 409 Conflict | Not 400. 409 is specifically "the resource state conflicts with the request." Useful for idempotency-key collisions and unique-constraint violations. |
| Client has exceeded their rate limit | 429 Too Many Requests | Not 503. 429 is a client-side quota issue. Respond with Retry-After. 503 means the server is genuinely overloaded — a different signal entirely. |
| Payment required or subscription expired | 402 Payment Required | This is the only HTTP status that exists specifically for billing-gate scenarios; use it. Include a machine code like subscription_expired so the client can route to the billing page. |
| Business logic decline (e.g. card declined, fraud block) | 402 (or 200 with a status: "failed" in the body) | Some platforms use 200 here because the request was received and processed correctly; the outcome was a decline. Both are defensible — pick one and be consistent. See anti-pattern below. |
The "200 with an error body" anti-pattern
A pervasive pattern in older APIs: every response returns HTTP 200, and success vs. failure is communicated only in the body with a status: "error" field. GraphQL uses a variant of this. The motivation is usually "it simplifies the client" — you always get a 200, so you check the body every time. This trades one kind of complexity for three worse kinds.
| What breaks | Why 200-with-error causes it |
|---|---|
| HTTP caches | Caches store 200 responses. A cached "error" body will be served to subsequent clients. The second requester gets a stale error for a request they never made. |
| Monitoring & alerting | Error-rate dashboards count non-2xx responses. If errors return 200, your error rate is always 0%. Real problems are invisible in your SLO dashboard. |
| Client retry logic and SDKs | HTTP-aware SDKs, API gateways, and reverse proxies use the status code to decide whether to retry. Returning 200 for a transient error means the SDK never retries it. |
| Load balancers and health checks | HAProxy, Nginx, and AWS ALB health checks pass a 200 as "healthy." An endpoint returning 200 for every request — including ones that are fundamentally broken — looks healthy when it isn't. |
The single exception: a batch operation where some items succeed and some fail. Use 207 Multi-Status — the response succeeded as a request but contains per-item outcomes. This is distinct from a top-level request failure. See the "Partial failures" section.
Error taxonomy: client, transient, and business
Beyond individual status codes, there is a higher-level taxonomy that determines the correct action a client should take. Getting this taxonomy right in your documentation — and in your response body — is what enables fully automated recovery in client SDKs.
Signalling retryability
Documenting retryability in your developer docs is necessary but not sufficient — well-designed SDKs need to know it at runtime. The cleanest approach is an explicit field in the error body:
{
"error": {
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"message": "Too many requests. Please slow down.",
"retryable": true, // machine-readable retry signal
"retry_after": 23 // seconds to wait (also in Retry-After header)
}
}
For business errors that require user action before retrying, use a dedicated value — not just false — so the SDK can distinguish "don't retry" from "retry after user fixes something":
{
"error": {
"type": "card_error",
"code": "card_declined",
"decline_code": "insufficient_funds", // Stripe's second-level code
"message": "Your card has insufficient funds.",
"retryable": "after_user_action" // not a bool; distinguishes the case
}
}
Field-level validation errors
Validation errors deserve special treatment because a single request can violate multiple constraints at once. A user submitting a form should get all the errors in one round trip, not one at a time. The pattern is to promote the error body to contain a list of per-field problems:
// 422 — multiple validation failures in one response
{
"error": {
"type": "invalid_request_error",
"code": "validation_failed",
"message": "The request contains invalid parameters.",
"details": [
{
"param": "amount",
"code": "amount_too_small",
"message": "Must be at least 50 (cents)."
},
{
"param": "currency",
"code": "unsupported_currency",
"message": "'XYZ' is not a supported currency code."
}
],
"request_id": "req_9GqX2mHk7fP3n"
}
}
The top-level code is validation_failed (machine-readable, stable). The per-field codes in details are also stable. The message at each level is human-readable prose. A web client can walk details and highlight each problematic field; a CLI tool can print them all at once.
Partial and batch failures
A batch endpoint accepts many items in one request. Some may succeed; others may fail. Returning 400 or 422 implies the whole batch failed. Returning 200 implies all items succeeded. Neither is right for partial success. The correct status is 207 Multi-Status, borrowed from WebDAV and adopted widely for batch APIs. Each item in the response carries its own status code and, if it failed, its own error object:
// POST /v1/charges/batch — 207 Multi-Status
{
"results": [
{
"index": 0,
"status": 200,
"id": "ch_abc"
},
{
"index": 1,
"status": 422,
"error": {
"type": "invalid_request_error",
"code": "amount_too_small",
"message": "Amount must be at least 50 cents.",
"param": "amount"
}
},
{
"index": 2,
"status": 200,
"id": "ch_xyz"
}
]
}
The request_id: linking errors to logs
The request_id is a unique identifier generated by your API server at the start of processing — before any business logic runs. It should appear in every response, successful or not. It must appear in every error. Its purpose is to allow a developer or support agent to paste one string into a logging query and retrieve the complete, correlated trace for that request.
Generate it as a prefixed opaque token — req_ followed by a URL-safe base62 string. Prefix the ID so it is immediately recognisable in a paste. Log it at the start and end of the request lifecycle, including every downstream call made during that request. Set it as the X-Request-Id response header as well, so clients who can't read the body (e.g. a load balancer doing health checks) can still correlate.
Standards: RFC 9457, Google, and gRPC
The API industry has not converged on a single error format, but three influential standards are worth knowing. None of them is inherently better — your choice depends on your ecosystem and how much you want to align with existing tooling.
| Standard | Key shape | Best for |
|---|---|---|
| RFC 9457 Problem+JSON | { type (URI), title, status, detail, instance }. Content-Type: application/problem+json. type is a URI that is the stable machine-readable identifier; detail is human text. |
RESTful HTTP APIs where interoperability and IETF alignment matter. Standard tooling (OpenAPI, API gateways) increasingly understands application/problem+json. |
| Google AIP-193 / Status+Details | { error: { code (HTTP int), message, status (canonical string), details (any[]) } }. Details is a list of typed objects — e.g. a BadRequest.FieldViolation proto for field errors, a RetryInfo for transient errors. |
Protobuf / gRPC APIs that need rich, typed detail objects. Google Cloud APIs follow this model universally. |
| gRPC Status Codes | 16 canonical codes: OK, INVALID_ARGUMENT, NOT_FOUND, ALREADY_EXISTS, RESOURCE_EXHAUSTED (rate limit), UNAVAILABLE (transient), FAILED_PRECONDITION, etc. |
Any gRPC service. The canonical code maps directly to the HTTP status via a standard mapping table. |
Under the hood: how it actually works end-to-end
Trace what happens when a payment API receives a request with an invalid amount. The error model is not a separate layer — it runs through the entire stack from input parsing to the bytes on the wire.
- Request arrives at the API gateway. The gateway performs authentication (checking the API key against the database). If authentication fails, the gateway returns 401 immediately — before the request even reaches the application server. The
request_idis generated here and attached to every downstream log entry for this request. - Request deserialization. The framework parses the JSON body. If the JSON is malformed (syntax error), the framework returns 400 Bad Request before your code runs. The error body at this level typically says "failed to parse body" — it does not have a domain error code because no domain logic ran.
- Schema validation middleware. Your validation layer checks the parsed body against the request schema. It collects all validation violations (not just the first). If any violations exist, it constructs the
detailsarray and returns 422. Therequest_idis included in the body, having been propagated through a thread-local or request context. - Domain logic. Validation passed; the business logic runs. A card charge is attempted. The payment processor declines:
insufficient_funds. The application maps this to the error envelope:type=card_error,code=card_declined,decline_code=insufficient_funds. HTTP status: 402. - Error middleware assembles the response. A top-level error handler catches any unhandled exception (e.g. a database connection failure). It logs the full stack trace internally and returns a sanitised 500 with only the
request_idin the body — no internal details. This is the firewall between your internal implementation and what the client sees. - Response serialisation. The error object is serialised to JSON. Content-Type is set to
application/json(orapplication/problem+jsonif using RFC 9457). Status line, headers, and body are written to the socket.
Design trade-offs
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Code format | Numeric codes (e.g. 1042) |
String codes (e.g. "card_declined") |
String. Numeric codes require a lookup table; strings are self-describing. Stripe switched from numeric to string codes early in its history for exactly this reason. |
| Code hierarchy | Flat — a single code field per error | Nested — a type (category) + a code (specific) |
Nested. Clients who need coarse branching match on type; clients who need fine-grained handling match on code. One without the other forces all clients to the same level of granularity. |
| Error list | Single error per response | Multiple errors per response (details array) | Both simultaneously: a top-level code for the overall failure, plus a details list for field-level violations. Single error for non-validation failures; list for validation failures. |
| Format standard | RFC 9457 problem+json | Custom envelope (Stripe-style) | Custom envelope for product-focused APIs where DX matters (richer, more navigable). RFC 9457 for infrastructure or B2B APIs where IETF interoperability and existing tooling support is valued. |
By the numbers
Consider a payment API processing 5,000 charges/min (modeled). Historical data shows a 2% card decline rate and a 0.5% validation error rate. The error model choices have measurable operational impact.
| Scenario | Calculation (modeled) | Impact |
|---|---|---|
| Card declines per minute | 5,000 × 0.02 = 100 decline events/min | 100 support tickets/min without a self-service doc_url — each one costs ~5 min of support time |
| Validation errors per minute | 5,000 × 0.005 = 25 validation errors/min | Without field-level codes, each requires a developer to read docs to find which field is wrong — multiplied by all API consumers |
| Retries amplified by wrong status code | If card declines return 503 instead of 402: each decline triggers an SDK retry × 3 retries = 300 extra requests/min (6% traffic increase) for requests that will never succeed | Unnecessary load; hides real errors in retry noise |
Support ticket deflection via doc_url |
Industry benchmark (modeled): ~40% of error-related tickets are resolved when a contextual docs link is present in the error. 100 declines/min × 60 min × 0.40 = 2,400 fewer tickets/day | Direct reduction in support cost |
How real platforms do it
Error model design is where you see the starkest differences between platforms. The choices reflect the platform's maturity, their support volume, and how much they invest in developer experience.
| Platform | Error shape | Notable feature | Source |
|---|---|---|---|
| Stripe | { error: { type, code, decline_code, message, param, charge } } |
Three-level hierarchy: type (category) → code (specific error) → decline_code (processor decline reason). Allows both coarse (type == "card_error") and fine (decline_code == "do_not_honor") branching. Doc URL is inferred from code, not included in the body. |
Stripe error object docs |
| Twilio | { code, message, more_info, status } |
Every error includes a more_info URL that points to the exact error reference page — the Twilio error dictionary has over 3,000 entries, each with cause and resolution. This alone significantly reduces support volume. |
Twilio error dictionary |
| Google Cloud APIs | { error: { code (HTTP int), message, status (canonical string), details [] } } |
The details array is typed: each element is a proto Any with a known type URL. BadRequest carries field violations; RetryInfo carries a retry_delay; QuotaFailure names the quota that was exceeded. Richer than most, but requires proto tooling to parse fully. |
Google Cloud API design guide — errors |
| GitHub REST API | { message, errors [], documentation_url } |
For validation errors, the errors array includes resource, field, and code per item. The top-level documentation_url is always present on errors. Simple, consistent, docs-first. |
GitHub REST API error handling |
"Design the error model for a payment API." Interviewers at Stripe-calibre companies look for: (1) the three layers — HTTP status, machine code, human message — and why they're separate; (2) the retryability taxonomy with explicit signalling; (3) field-level validation errors in a details array; (4) the 200-with-error anti-pattern and its three failure modes; (5) a request_id in every response for support correlation; and (6) a decision on RFC 9457 vs. custom envelope with a reasoned preference. Most candidates cover only the happy-path → error-code mapping and miss the retryability and support-correlation dimensions entirely.
Error codes are public API — once published, you can never remove one without breaking clients. Teams that don't audit codes regularly end up with hundreds of overlapping codes where card_declined, charge_declined, and payment_failed all exist for the same scenario, created by different engineers at different times. Establish a code registry (a YAML or markdown file in your repo), gate new codes through review, and treat code naming with the same care as a public function name.
Write integration tests that assert the exact type, code, and HTTP status of every known error scenario — not just that the response is "some 4xx." If a refactor changes amount_too_small to charge_amount_invalid, your test catches it as a breaking change before it reaches production. Error code stability is a contract; enforce it with the same rigour as your schema.
How to debug & inspect it
When an API client reports an error, the fastest path to resolution is the request_id. Everything else — the error code, the status code, the message — narrows down the cause. The request_id is the key that opens the server log.
Trace an error end-to-end with curl
Symptom → cause → fix
| Symptom | Likely cause | Fix |
|---|---|---|
| Client gets 500 with no error body | Unhandled exception; error middleware is missing or not catching it | Add a catch-all error handler that logs the full trace internally and returns a sanitised 500 with only request_id |
| Client SDK retries a card decline 3 times, burning the customer's card limit | Card decline returned 503 instead of 402; SDK sees 503 as retryable | Map card errors to 402 (or a non-retryable 4xx); add "retryable": false in the body |
| Client gets 400 but doesn't know which field is wrong | Validation error returns only a top-level message, no param or details | Collect all validation violations and return a 422 with a details array |
| Error code suddenly changes mid-integration | No code registry; codes renamed during refactor | Maintain a code registry; treat code names as breaking changes; enforce with integration tests |
| Support can't find the request in logs when given an error code but no ID | request_id is missing from the error body | Generate a request_id at the start of every request; include it in both the response header (X-Request-Id) and the error body |
🧠 Quick check
1. A client request passes JSON parsing and HTTP routing, but the amount field is negative when it must be positive. Which HTTP status should the server return?
422 is specifically for semantically invalid requests: the HTTP framing is fine, the JSON parsed correctly, but the content violates the domain rules. 400 is for syntactically malformed requests (bad JSON, invalid headers). 422 gives clients the precise signal they need to show field-level errors.
2. A payment API returns HTTP 200 for every response, including card declines, with a JSON body that has "status": "declined". Which of the following breaks?
Caches store 200 responses — a "declined" body gets cached and served stale. Monitoring dashboards count non-2xx as errors — an all-200 API looks healthy even when every payment fails. SDK retry logic uses the status to decide whether to retry — a 200 decline is never retried. The status code is consumed by infrastructure, not just application code.
3. Which field in an error envelope should a client application branch on to decide how to handle the error?
The code field (e.g. "card_declined") is the machine-readable stable identifier. The message is for humans and can be reworded at any time. The HTTP status is too coarse — 402 covers many distinct card errors. The request_id is for tracing, not branching.
4. A batch endpoint receives 3 items: item 0 succeeds, item 1 fails validation, item 2 succeeds. What is the correct HTTP status for the response?
207 Multi-Status is the correct code for a batch operation where different items have different outcomes. The response body contains a results array, each entry with its own status code and, if applicable, an error object. Returning 200 would hide the failure; returning 422 would hide the success.
5. A client receives a card_declined error with HTTP 402. According to the error taxonomy, what should the client do?
Card declines are business errors — the request was syntactically and technically valid, but the business outcome failed due to the card's state. The correct action is to surface the issue to the user (e.g. "Your card was declined — please update your payment method") and retry only after the user takes corrective action. Retrying the same request immediately will always fail.
✍️ Exercise: audit and redesign an error model
The following API returns this error body for a failed charge. Identify every problem with this design and rewrite the response correctly.
Model answer — four problems:
- HTTP 200 for an error. Caches will store this. Monitoring shows 0% error rate. SDKs won't retry. Fix: use 422 Unprocessable Entity for a validation failure (the amount is below minimum).
- No machine-readable error code. The client has to parse the message string to understand the error. Fix: add
"type": "invalid_request_error"and"code": "amount_too_small". - No
paramfield. The client cannot programmatically know which field to highlight. Fix: add"param": "amount". - No
request_id. Support cannot trace this error. Fix: add"request_id": "req_..."in both the body and theX-Request-Idresponse header.
Corrected response:
Rubric: Full marks for all four problems identified with correct fixes. Partial marks for any two. Bonus: noting that the human message in the original embeds the exact dollar amount, which is fine in a message but would be wrong in a code.
Key takeaways
- The error envelope has three layers for three consumers: HTTP status for infrastructure, machine code for application branching (stable; never parse the message), human message for developers (unstable).
- Map domain errors to precise HTTP status codes: 422 for semantic validation, 401 vs. 403 for authn vs. authz, 409 for conflicts, 429 for rate limits, 402 for business gates.
- The "200 with an error body" anti-pattern breaks caches, monitoring, and SDK retry logic simultaneously.
- Error taxonomy — client (don't retry), transient (retry with backoff), business (fix then retry) — should be explicitly signalled with a
retryablefield, not just implied by the status code. - Return all validation errors in a single 422 with a
detailsarray. Never make the client fix one error at a time. - A
request_idin every response — header and body — is the single most valuable investment in your support and debuggability story.
Sources & further reading
- RFC 9457 — Problem Details for HTTP APIs
- Google AIP-193 — Errors
- Google Cloud API Design Guide — Errors
- Stripe API error object reference
- Twilio error dictionary
- gRPC status codes
- GitHub REST API best practices — error handling
- Lesson 07 — HTTP (status codes, headers)
- Reliability 05 — Retries & backoff
- Debug 02 — Reading error responses