Platform & API Product Engineering · Lesson 03

Designing your API error model

Your error contract is not an afterthought — it is the part of your API that determines how quickly an integration partner can diagnose and recover from a problem at 2 a.m. A badly designed error model makes every bug an escalation; a well-designed one lets clients self-serve.

⏱ 20 min Difficulty: advanced Prereq: HTTP basics, REST

By the end you'll be able to

Design a complete error envelope with the fields needed for machine-readable branching, human debugging, and support tracing.
Map every domain error to the right HTTP status code, and explain the three-layer model: transport status, machine code, and human message.
Classify errors as client, transient, or business — and signal retryability in the response so clients can act without guessing.

Why error design matters as much as success design

Most API design effort goes into the happy path: the right resource shape, sensible field names, a clean pagination cursor. The error path gets an afterthought — a string message and a 400. But the error path is where integration problems live. When a payment fails, a webhook bounces, or a validation check rejects a request, the quality of your error response determines whether the developer can fix it in minutes or files a support ticket.

Think of the error response as a patient chart at a hospital. The HTTP status code is the triage category on the door — "urgent" or "stable". The machine-readable error code is the diagnosis — precise and stable so a computer can route it. The human message is the doctor's note — readable but not something software should parse, because next month the wording might change. The request ID is the chart number that lets support pull the full record. Strip any of these layers and the chart becomes useless to half the people who need it.

The error envelope

A well-designed error body wraps a single error object at the top level. The outer key keeps success and error shapes clearly separated — a client can check if ("error" in body) to know which branch to take, rather than peeking at status codes already captured at the transport layer.

type — stable top-level category; clients branch on this (machine-readable). · code — specific, stable sub-code; never changes meaning across versions. · message — prose for humans; may be updated in place; never parse it in code. · param — field that caused the error, for field-level UI highlighting. · request_id — opaque trace ID; correlates to your internal logs. · doc_url — deep link to the error reference page; eliminates the "what does this mean?" support ticket.

// Full error envelope — canonical structure
{
  "error": {
    // MACHINE layer — clients branch on these; they are CONTRACT-STABLE
    "type":       "invalid_request_error",  // top-level category
    "code":       "amount_too_small",       // stable, specific sub-code

    // HUMAN layer — readable prose; NEVER parse this string
    "message":    "Amount must be at least 50 cents (USD).",

    // FIELD layer — for validation errors; which param is wrong?
    "param":      "amount",

    // SUPPORT layer — trace it in your logs; paste it in a ticket
    "request_id": "req_9GqX2mHk7fP3n",
    "doc_url":    "https://docs.example.com/errors/amount_too_small"
  }
}

The three layers: status, code, message

The error contract has three distinct consumers. HTTP status codes are consumed by infrastructure — proxies, load balancers, CDNs, monitoring systems. Machine-readable codes are consumed by client code — the if code == "card_declined" branch. Human messages are consumed by people — the developer reading the log at 2 a.m. Mixing these layers causes problems: if your client branches on message text, a copy edit to fix a typo breaks your integration.

Fig 1 — Domain errors map to HTTP status through the error class. Client errors (4xx) signal "don't retry until the request is fixed." Transient errors (429, 503) signal "retry after a delay." Business errors signal "fix the underlying condition first."

The three layers explained

Layer	Consumer	Stability contract	Examples
HTTP status	Proxies, CDNs, monitoring, client retry logic	Stable forever — HTTP statuses are standardised by the IETF	422, 429, 503
Machine code (`code` field)	Client application code (`if` branches, switch statements)	Contract-stable — you never remove or rename a code; add new ones as needed	`card_declined`, `rate_limit_exceeded`, `insufficient_funds`
Human message	Developers reading logs or consoles; end users seeing UI copy	Unstable — you may reword it, translate it, or add detail; code must never parse it	"Amount must be at least 50 cents."

⚠️ Never parse error message strings in code

A common integration bug: if (error.message.includes("card was declined")). This is fragile. The moment your team fixes the grammar — "Your card was declined" — the client branch silently stops matching. Use error.code === "card_declined". The code is stable; the message is not.

Mapping domain errors to HTTP status codes

HTTP status codes are the transport signal. They must be chosen based on what the error means to the protocol, not just what the error means to your domain. Many teams default to 400 for every client error, which is too coarse. A misconfigured rate limiter that silently returns 400 instead of 429 prevents every well-behaved SDK from triggering its automatic retry logic.

Domain scenario	Correct status	Why not the common wrong choice
Request body fails schema validation (wrong type, missing required field)	422 Unprocessable Entity	Not 400. 400 means "the request was malformed at the HTTP level" (bad JSON, illegal headers). 422 means "the syntax is fine but the semantics are invalid."
Missing or invalid credentials	401 Unauthorized	Not 403. 401 means "we don't know who you are; authenticate." The browser knows to prompt for credentials on 401.
Valid credentials, but insufficient permission	403 Forbidden	Not 401. 403 means "we know exactly who you are and the answer is still no." Re-authenticating won't help.
Client tries to create a resource that already exists (duplicate key)	409 Conflict	Not 400. 409 is specifically "the resource state conflicts with the request." Useful for idempotency-key collisions and unique-constraint violations.
Client has exceeded their rate limit	429 Too Many Requests	Not 503. 429 is a client-side quota issue. Respond with `Retry-After`. 503 means the server is genuinely overloaded — a different signal entirely.
Payment required or subscription expired	402 Payment Required	This is the only HTTP status that exists specifically for billing-gate scenarios; use it. Include a machine code like `subscription_expired` so the client can route to the billing page.
Business logic decline (e.g. card declined, fraud block)	402 (or 200 with a `status: "failed"` in the body)	Some platforms use 200 here because the request was received and processed correctly; the outcome was a decline. Both are defensible — pick one and be consistent. See anti-pattern below.

The "200 with an error body" anti-pattern

A pervasive pattern in older APIs: every response returns HTTP 200, and success vs. failure is communicated only in the body with a status: "error" field. GraphQL uses a variant of this. The motivation is usually "it simplifies the client" — you always get a 200, so you check the body every time. This trades one kind of complexity for three worse kinds.

What breaks	Why 200-with-error causes it
HTTP caches	Caches store 200 responses. A cached "error" body will be served to subsequent clients. The second requester gets a stale error for a request they never made.
Monitoring & alerting	Error-rate dashboards count non-2xx responses. If errors return 200, your error rate is always 0%. Real problems are invisible in your SLO dashboard.
Client retry logic and SDKs	HTTP-aware SDKs, API gateways, and reverse proxies use the status code to decide whether to retry. Returning 200 for a transient error means the SDK never retries it.
Load balancers and health checks	HAProxy, Nginx, and AWS ALB health checks pass a 200 as "healthy." An endpoint returning 200 for every request — including ones that are fundamentally broken — looks healthy when it isn't.

✅ Use HTTP status codes as the primary signal

The single exception: a batch operation where some items succeed and some fail. Use 207 Multi-Status — the response succeeded as a request but contains per-item outcomes. This is distinct from a top-level request failure. See the "Partial failures" section.

Error taxonomy: client, transient, and business

Beyond individual status codes, there is a higher-level taxonomy that determines the correct action a client should take. Getting this taxonomy right in your documentation — and in your response body — is what enables fully automated recovery in client SDKs.

Fig 2 — The three error classes and their client actions. Signal the class explicitly in the response so clients can implement automated recovery without guessing.

Signalling retryability

Documenting retryability in your developer docs is necessary but not sufficient — well-designed SDKs need to know it at runtime. The cleanest approach is an explicit field in the error body:

{
  "error": {
    "type":      "rate_limit_error",
    "code":      "rate_limit_exceeded",
    "message":   "Too many requests. Please slow down.",
    "retryable": true,         // machine-readable retry signal
    "retry_after": 23          // seconds to wait (also in Retry-After header)
  }
}

For business errors that require user action before retrying, use a dedicated value — not just false — so the SDK can distinguish "don't retry" from "retry after user fixes something":

{
  "error": {
    "type":      "card_error",
    "code":      "card_declined",
    "decline_code": "insufficient_funds",  // Stripe's second-level code
    "message":   "Your card has insufficient funds.",
    "retryable": "after_user_action"      // not a bool; distinguishes the case
  }
}

Field-level validation errors

Validation errors deserve special treatment because a single request can violate multiple constraints at once. A user submitting a form should get all the errors in one round trip, not one at a time. The pattern is to promote the error body to contain a list of per-field problems:

// 422 — multiple validation failures in one response
{
  "error": {
    "type":    "invalid_request_error",
    "code":    "validation_failed",
    "message": "The request contains invalid parameters.",
    "details": [
      {
        "param":   "amount",
        "code":    "amount_too_small",
        "message": "Must be at least 50 (cents)."
      },
      {
        "param":   "currency",
        "code":    "unsupported_currency",
        "message": "'XYZ' is not a supported currency code."
      }
    ],
    "request_id": "req_9GqX2mHk7fP3n"
  }
}

The top-level code is validation_failed (machine-readable, stable). The per-field codes in details are also stable. The message at each level is human-readable prose. A web client can walk details and highlight each problematic field; a CLI tool can print them all at once.

Partial and batch failures

A batch endpoint accepts many items in one request. Some may succeed; others may fail. Returning 400 or 422 implies the whole batch failed. Returning 200 implies all items succeeded. Neither is right for partial success. The correct status is 207 Multi-Status, borrowed from WebDAV and adopted widely for batch APIs. Each item in the response carries its own status code and, if it failed, its own error object:

// POST /v1/charges/batch — 207 Multi-Status
{
  "results": [
    {
      "index":  0,
      "status": 200,
      "id":     "ch_abc"
    },
    {
      "index":  1,
      "status": 422,
      "error":  {
        "type":    "invalid_request_error",
        "code":    "amount_too_small",
        "message": "Amount must be at least 50 cents.",
        "param":   "amount"
      }
    },
    {
      "index":  2,
      "status": 200,
      "id":     "ch_xyz"
    }
  ]
}

The request_id: linking errors to logs

The request_id is a unique identifier generated by your API server at the start of processing — before any business logic runs. It should appear in every response, successful or not. It must appear in every error. Its purpose is to allow a developer or support agent to paste one string into a logging query and retrieve the complete, correlated trace for that request.

Generate it as a prefixed opaque token — req_ followed by a URL-safe base62 string. Prefix the ID so it is immediately recognisable in a paste. Log it at the start and end of the request lifecycle, including every downstream call made during that request. Set it as the X-Request-Id response header as well, so clients who can't read the body (e.g. a load balancer doing health checks) can still correlate.

Standards: RFC 9457, Google, and gRPC

The API industry has not converged on a single error format, but three influential standards are worth knowing. None of them is inherently better — your choice depends on your ecosystem and how much you want to align with existing tooling.

Standard	Key shape	Best for
RFC 9457 Problem+JSON	`{ type (URI), title, status, detail, instance }`. Content-Type: `application/problem+json`. `type` is a URI that is the stable machine-readable identifier; `detail` is human text.	RESTful HTTP APIs where interoperability and IETF alignment matter. Standard tooling (OpenAPI, API gateways) increasingly understands `application/problem+json`.
Google AIP-193 / Status+Details	`{ error: { code (HTTP int), message, status (canonical string), details (any[]) } }`. Details is a list of typed objects — e.g. a `BadRequest.FieldViolation` proto for field errors, a `RetryInfo` for transient errors.	Protobuf / gRPC APIs that need rich, typed detail objects. Google Cloud APIs follow this model universally.
gRPC Status Codes	16 canonical codes: `OK`, `INVALID_ARGUMENT`, `NOT_FOUND`, `ALREADY_EXISTS`, `RESOURCE_EXHAUSTED` (rate limit), `UNAVAILABLE` (transient), `FAILED_PRECONDITION`, etc.	Any gRPC service. The canonical code maps directly to the HTTP status via a standard mapping table.

Under the hood: how it actually works end-to-end

Trace what happens when a payment API receives a request with an invalid amount. The error model is not a separate layer — it runs through the entire stack from input parsing to the bytes on the wire.

Request arrives at the API gateway. The gateway performs authentication (checking the API key against the database). If authentication fails, the gateway returns 401 immediately — before the request even reaches the application server. The request_id is generated here and attached to every downstream log entry for this request.
Request deserialization. The framework parses the JSON body. If the JSON is malformed (syntax error), the framework returns 400 Bad Request before your code runs. The error body at this level typically says "failed to parse body" — it does not have a domain error code because no domain logic ran.
Schema validation middleware. Your validation layer checks the parsed body against the request schema. It collects all validation violations (not just the first). If any violations exist, it constructs the details array and returns 422. The request_id is included in the body, having been propagated through a thread-local or request context.
Domain logic. Validation passed; the business logic runs. A card charge is attempted. The payment processor declines: insufficient_funds. The application maps this to the error envelope: type=card_error, code=card_declined, decline_code=insufficient_funds. HTTP status: 402.
Error middleware assembles the response. A top-level error handler catches any unhandled exception (e.g. a database connection failure). It logs the full stack trace internally and returns a sanitised 500 with only the request_id in the body — no internal details. This is the firewall between your internal implementation and what the client sees.
Response serialisation. The error object is serialised to JSON. Content-Type is set to application/json (or application/problem+json if using RFC 9457). Status line, headers, and body are written to the socket.

POST /v1/charges HTTP/1.1 Authorization: Bearer sk_live_abc123 Content-Type: application/json {"amount": 20, "currency": "usd", "source": "tok_visa"} ------ HTTP/1.1 402 Payment Required Content-Type: application/json X-Request-Id: req_9GqX2mHk7fP3n { "error": { "type": "card_error", "code": "card_declined", "decline_code": "insufficient_funds", "message": "Your card has insufficient funds.", "request_id": "req_9GqX2mHk7fP3n", "doc_url": "https://docs.example.com/errors/card_declined" } }

Design trade-offs

Decision	Option A	Option B	Recommendation
Code format	Numeric codes (e.g. `1042`)	String codes (e.g. `"card_declined"`)	String. Numeric codes require a lookup table; strings are self-describing. Stripe switched from numeric to string codes early in its history for exactly this reason.
Code hierarchy	Flat — a single code field per error	Nested — a `type` (category) + a `code` (specific)	Nested. Clients who need coarse branching match on `type`; clients who need fine-grained handling match on `code`. One without the other forces all clients to the same level of granularity.
Error list	Single error per response	Multiple errors per response (details array)	Both simultaneously: a top-level code for the overall failure, plus a `details` list for field-level violations. Single error for non-validation failures; list for validation failures.
Format standard	RFC 9457 problem+json	Custom envelope (Stripe-style)	Custom envelope for product-focused APIs where DX matters (richer, more navigable). RFC 9457 for infrastructure or B2B APIs where IETF interoperability and existing tooling support is valued.

By the numbers

Consider a payment API processing 5,000 charges/min (modeled). Historical data shows a 2% card decline rate and a 0.5% validation error rate. The error model choices have measurable operational impact.

Scenario	Calculation (modeled)	Impact
Card declines per minute	5,000 × 0.02 = 100 decline events/min	100 support tickets/min without a self-service `doc_url` — each one costs ~5 min of support time
Validation errors per minute	5,000 × 0.005 = 25 validation errors/min	Without field-level codes, each requires a developer to read docs to find which field is wrong — multiplied by all API consumers
Retries amplified by wrong status code	If card declines return 503 instead of 402: each decline triggers an SDK retry × 3 retries = 300 extra requests/min (6% traffic increase) for requests that will never succeed	Unnecessary load; hides real errors in retry noise
Support ticket deflection via `doc_url`	Industry benchmark (modeled): ~40% of error-related tickets are resolved when a contextual docs link is present in the error. 100 declines/min × 60 min × 0.40 = 2,400 fewer tickets/day	Direct reduction in support cost

How real platforms do it

Error model design is where you see the starkest differences between platforms. The choices reflect the platform's maturity, their support volume, and how much they invest in developer experience.

Platform	Error shape	Notable feature	Source
Stripe	`{ error: { type, code, decline_code, message, param, charge } }`	Three-level hierarchy: `type` (category) → `code` (specific error) → `decline_code` (processor decline reason). Allows both coarse (`type == "card_error"`) and fine (`decline_code == "do_not_honor"`) branching. Doc URL is inferred from code, not included in the body.	Stripe error object docs
Twilio	`{ code, message, more_info, status }`	Every error includes a `more_info` URL that points to the exact error reference page — the Twilio error dictionary has over 3,000 entries, each with cause and resolution. This alone significantly reduces support volume.	Twilio error dictionary
Google Cloud APIs	`{ error: { code (HTTP int), message, status (canonical string), details [] } }`	The `details` array is typed: each element is a proto Any with a known type URL. `BadRequest` carries field violations; `RetryInfo` carries a `retry_delay`; `QuotaFailure` names the quota that was exceeded. Richer than most, but requires proto tooling to parse fully.	Google Cloud API design guide — errors
GitHub REST API	`{ message, errors [], documentation_url }`	For validation errors, the `errors` array includes `resource`, `field`, and `code` per item. The top-level `documentation_url` is always present on errors. Simple, consistent, docs-first.	GitHub REST API error handling

🎯 Interview angle

"Design the error model for a payment API." Interviewers at Stripe-calibre companies look for: (1) the three layers — HTTP status, machine code, human message — and why they're separate; (2) the retryability taxonomy with explicit signalling; (3) field-level validation errors in a details array; (4) the 200-with-error anti-pattern and its three failure modes; (5) a request_id in every response for support correlation; and (6) a decision on RFC 9457 vs. custom envelope with a reasoned preference. Most candidates cover only the happy-path → error-code mapping and miss the retryability and support-correlation dimensions entirely.

⚠️ The "code creep" trap

Error codes are public API — once published, you can never remove one without breaking clients. Teams that don't audit codes regularly end up with hundreds of overlapping codes where card_declined, charge_declined, and payment_failed all exist for the same scenario, created by different engineers at different times. Establish a code registry (a YAML or markdown file in your repo), gate new codes through review, and treat code naming with the same care as a public function name.

✅ Test your error contract like a public API

Write integration tests that assert the exact type, code, and HTTP status of every known error scenario — not just that the response is "some 4xx." If a refactor changes amount_too_small to charge_amount_invalid, your test catches it as a breaking change before it reaches production. Error code stability is a contract; enforce it with the same rigour as your schema.

How to debug & inspect it

When an API client reports an error, the fastest path to resolution is the request_id. Everything else — the error code, the status code, the message — narrows down the cause. The request_id is the key that opens the server log.

Trace an error end-to-end with curl

$ curl -sv -X POST https://api.example.com/v1/charges \ -H "Authorization: Bearer sk_test_abc" \ -H "Content-Type: application/json" \ -d '{"amount": 20, "currency": "usd", "source": "tok_visa"}' < HTTP/1.1 402 Payment Required < X-Request-Id: req_9GqX2mHk7fP3n < Content-Type: application/json { "error": { "type": "card_error", "code": "card_declined", "decline_code": "insufficient_funds", "message": "Your card has insufficient funds.", "request_id": "req_9GqX2mHk7fP3n", "doc_url": "https://docs.example.com/errors/card_declined" } } # Step 2: look up the request in your logging system $ grep "req_9GqX2mHk7fP3n" /var/log/api.log 2026-06-20T14:07:33Z req_9GqX2mHk7fP3n POST /v1/charges user=usr_42 status=402 code=card_declined duration=234ms

Symptom → cause → fix

Symptom	Likely cause	Fix
Client gets 500 with no error body	Unhandled exception; error middleware is missing or not catching it	Add a catch-all error handler that logs the full trace internally and returns a sanitised 500 with only `request_id`
Client SDK retries a card decline 3 times, burning the customer's card limit	Card decline returned 503 instead of 402; SDK sees 503 as retryable	Map card errors to 402 (or a non-retryable 4xx); add `"retryable": false` in the body
Client gets 400 but doesn't know which field is wrong	Validation error returns only a top-level message, no `param` or `details`	Collect all validation violations and return a 422 with a `details` array
Error code suddenly changes mid-integration	No code registry; codes renamed during refactor	Maintain a code registry; treat code names as breaking changes; enforce with integration tests
Support can't find the request in logs when given an error code but no ID	`request_id` is missing from the error body	Generate a `request_id` at the start of every request; include it in both the response header (`X-Request-Id`) and the error body

🧠 Quick check

1. A client request passes JSON parsing and HTTP routing, but the amount field is negative when it must be positive. Which HTTP status should the server return?

422 is specifically for semantically invalid requests: the HTTP framing is fine, the JSON parsed correctly, but the content violates the domain rules. 400 is for syntactically malformed requests (bad JSON, invalid headers). 422 gives clients the precise signal they need to show field-level errors.

2. A payment API returns HTTP 200 for every response, including card declines, with a JSON body that has "status": "declined". Which of the following breaks?

Caches store 200 responses — a "declined" body gets cached and served stale. Monitoring dashboards count non-2xx as errors — an all-200 API looks healthy even when every payment fails. SDK retry logic uses the status to decide whether to retry — a 200 decline is never retried. The status code is consumed by infrastructure, not just application code.

3. Which field in an error envelope should a client application branch on to decide how to handle the error?

The code field (e.g. "card_declined") is the machine-readable stable identifier. The message is for humans and can be reworded at any time. The HTTP status is too coarse — 402 covers many distinct card errors. The request_id is for tracing, not branching.

4. A batch endpoint receives 3 items: item 0 succeeds, item 1 fails validation, item 2 succeeds. What is the correct HTTP status for the response?

207 Multi-Status is the correct code for a batch operation where different items have different outcomes. The response body contains a results array, each entry with its own status code and, if applicable, an error object. Returning 200 would hide the failure; returning 422 would hide the success.

5. A client receives a card_declined error with HTTP 402. According to the error taxonomy, what should the client do?

Card declines are business errors — the request was syntactically and technically valid, but the business outcome failed due to the card's state. The correct action is to surface the issue to the user (e.g. "Your card was declined — please update your payment method") and retry only after the user takes corrective action. Retrying the same request immediately will always fail.

✍️ Exercise: audit and redesign an error model

The following API returns this error body for a failed charge. Identify every problem with this design and rewrite the response correctly.

HTTP/1.1 200 OK Content-Type: application/json {"status": "error", "msg": "Your payment amount of $0.20 is below the minimum of $0.50 required for USD charges"}

Model answer — four problems:

HTTP 200 for an error. Caches will store this. Monitoring shows 0% error rate. SDKs won't retry. Fix: use 422 Unprocessable Entity for a validation failure (the amount is below minimum).
No machine-readable error code. The client has to parse the message string to understand the error. Fix: add "type": "invalid_request_error" and "code": "amount_too_small".
No param field. The client cannot programmatically know which field to highlight. Fix: add "param": "amount".
No request_id. Support cannot trace this error. Fix: add "request_id": "req_..." in both the body and the X-Request-Id response header.

Corrected response:

HTTP/1.1 422 Unprocessable Entity X-Request-Id: req_7XkPqR2mL Content-Type: application/json {"error": {"type": "invalid_request_error", "code": "amount_too_small", "message": "Amount must be at least 50 cents (USD).", "param": "amount", "request_id": "req_7XkPqR2mL"}}

Rubric: Full marks for all four problems identified with correct fixes. Partial marks for any two. Bonus: noting that the human message in the original embeds the exact dollar amount, which is fine in a message but would be wrong in a code.

Key takeaways

The error envelope has three layers for three consumers: HTTP status for infrastructure, machine code for application branching (stable; never parse the message), human message for developers (unstable).
Map domain errors to precise HTTP status codes: 422 for semantic validation, 401 vs. 403 for authn vs. authz, 409 for conflicts, 429 for rate limits, 402 for business gates.
The "200 with an error body" anti-pattern breaks caches, monitoring, and SDK retry logic simultaneously.
Error taxonomy — client (don't retry), transient (retry with backoff), business (fix then retry) — should be explicitly signalled with a retryable field, not just implied by the status code.
Return all validation errors in a single 422 with a details array. Never make the client fix one error at a time.
A request_id in every response — header and body — is the single most valuable investment in your support and debuggability story.