API Design

Reliability & Scale · Lesson 11

Evolving APIs without breaking clients

Every live API is a promise made to unknown strangers. Keeping that promise while still shipping new features is one of the hardest acts of software craftsmanship — because the moment you break it, every caller breaks silently, at a time you didn't choose.

⏱ 14 min Difficulty: advanced Prereq: Versioning (rel-01)

By the end you'll be able to

The tyranny of distributed callers

When you change a function inside one codebase, you find every call site with a compiler or a grep and fix them all in one commit. When you change a public API endpoint, your callers are out there in the wild — mobile apps that haven't updated, third-party integrations you've never heard of, scripts running in a bank's data center. You cannot find them all, and you cannot fix them. The only lever you have is backward compatibility: ensuring that old messages still work.

Think of the API like a train track gauge. The trains (clients) were built to fit the existing gauge. Widening the track is fine — old trains still run. Narrowing it, or moving the rails, derails everything already on the line.

The golden rule: additive changes are safe

An additive change adds something new without removing or modifying what already exists. It is the safest kind of evolution. A client that knows nothing about the new thing will simply ignore it — and that's the point. Concretely, these changes are safe to ship without a version bump:

What actually breaks clients

A breaking change is anything that causes a previously working client to fail, misbehave, or need code changes. The canonical list:

ChangeWhy it breaks
Renaming a fieldOld clients read the old name; they now get undefined.
Removing a field or endpointClients that depend on it 404 or receive incomplete data.
Changing a field's type"42"42 breaks JSON parsers expecting a string.
Changing semantics without changing shapeA field named count that used to mean total results and now means results per page — same shape, silent data corruption.
Tightening validationA previously accepted value now returns 400; client workflows fail.
Changing auth or error codesA client that pattern-matches on 403 vs 404 will misclassify errors.

The tolerant reader pattern

Jon Postel's robustness principle — "be conservative in what you send, be liberal in what you accept" — has a direct API design corollary called the tolerant reader. A client that applies this pattern:

  1. Reads only the fields it needs; ignores everything else.
  2. Treats unknown enum values as a known "unknown" sentinel rather than throwing.
  3. Does not hard-code the exact set of keys in a JSON object.

If you build your clients this way, the producer gains freedom: adding fields to a response is truly free, because consumers won't blow up when they encounter something new. The danger is in the other direction — if consumers are fragile (strict schema validation that rejects unknown fields, for instance), every additive change by the producer can still break them.

✅ Do this, not that

Do configure your JSON deserializer to ignore unknown properties (e.g. @JsonIgnoreProperties(ignoreUnknown = true) in Jackson, or extra = "ignore" in Pydantic). Don't use strict deserialization modes in production API clients — you will break every time the server adds a new field, even harmless ones.

Backward vs forward compatibility

Backward compatibility means new server code can handle old-format requests — clients that haven't updated still work. Forward compatibility means old server code can handle new-format requests — useful when you need to roll out a schema change to producers before consumers catch up. Both matter in practice: backward compatibility protects your existing consumers; forward compatibility protects rolling deployments where different server versions run simultaneously.

Expand-and-contract: the safe rename

Suppose you want to rename fullName to displayName in a response — a seemingly trivial change. Done naively it is a hard breaking change. The expand-and-contract pattern (also called parallel-change migration) turns one big breaking swap into three small safe steps:

Phase 1 — Expand Add new field alongside old one Both present in response Safe to deploy now Phase 2 — Migrate Update all consumers to read new field Old field still present No consumer breakage Phase 3 — Contract Remove old field only after all consumers have migrated Old field gone — safe Day 0 Day N (consumers updated) Day M (field removed)
Expand-and-contract turns a single breaking rename into three individually-safe deployments. The critical guarantee: old and new fields coexist throughout phase 2.

Worked example: safe vs breaking change

Consider a user profile endpoint. Below are two scenarios — adding a field (safe) and renaming one (breaking without expand-and-contract).

# --- SCENARIO A: Safe additive change ---
# Old response
{
  "id": "usr_99",
  "fullName": "Laila Ahmadi",
  "email": "laila@example.com"
}

# New response — adds optional "avatarUrl"; old clients ignore it
{
  "id": "usr_99",
  "fullName": "Laila Ahmadi",
  "email": "laila@example.com",
  "avatarUrl": "https://cdn.example.com/avatars/99.png"  // NEW — additive
}
# ✓ Safe: tolerant readers ignore the new field.


# --- SCENARIO B: Breaking rename (BAD — do NOT do this) ---
# Renaming "fullName" → "displayName" in one deploy
{
  "id": "usr_99",
  "displayName": "Laila Ahmadi",   // RENAMED — old clients read undefined
  "email": "laila@example.com"
}
# ✗ Breaking: any client using response.fullName now gets undefined.


# --- SCENARIO C: Expand-and-contract (GOOD) ---
# Phase 1: Expand — emit BOTH fields
{
  "id": "usr_99",
  "fullName": "Laila Ahmadi",     // OLD — still present
  "displayName": "Laila Ahmadi",  // NEW — added; consumers start reading this
  "email": "laila@example.com"
}
# Phase 2: Migrate all consumers to read "displayName".
# Phase 3: Contract — remove "fullName" only once all consumers migrated.
# ✓ No consumer ever encounters a missing field.

Deprecation with the Sunset header

When you need to retire something — an endpoint, a field, an entire API version — the worst thing you can do is remove it silently. A Sunset header (standardised in RFC 8594) tells API consumers exactly when a resource will stop working, so their monitoring systems can surface the warning automatically.

# Response from a deprecated endpoint
HTTP/1.1 200 OK
Sunset: Sat, 31 Dec 2025 23:59:59 GMT
Deprecation: Tue, 01 Jul 2025 00:00:00 GMT
Link: <https://docs.example.com/migrate-v2>; rel="deprecation"
Content-Type: application/json

{
  // ... response body unchanged during deprecation period
}

Good deprecation also means: a changelog announcement with migration docs; enough runway for consumers (typically 6–12 months for public APIs); a migration guide linked from the Link header; and ideally a developer-portal banner. The header alone is not enough — but it's the machine-readable anchor that tooling and alert systems can act on.

Feature flags for safe rollout

Feature flags let you deploy new API behaviour to a small slice of consumers before everyone sees it. A common pattern for API evolution:

  1. Deploy new behaviour behind a flag; it's off for all callers.
  2. Enable it for internal consumers first — your own frontend, your own mobile app.
  3. Enable for a beta cohort (opt-in partners, early adopters).
  4. Ramp to 100% and remove the old code path only after monitoring shows no regressions.

For breaking changes you cannot avoid (a full major version), feature flags let you keep the new behaviour dark until all consumers have had time to migrate, rather than forcing a flag-day cutover.

🎯 Interview angle

When asked "how would you evolve a live API that has thousands of external consumers?", the answer they want has three layers: (1) classify the change — is it additive or breaking? (2) if breaking, use expand-and-contract to do it in phases; (3) for removals, use a Sunset header with ample runway and a migration guide. Bonus: mention that tolerant readers on the client side reduce the blast radius of any change the server makes.

⚠️ Common trap

Tightening validation on a live endpoint is a silent breaking change that is easy to overlook. Suppose you previously accepted country as any string and now you enforce it must be a valid ISO 3166 code. Every existing client sending "UK" (which is not a valid ISO code — it's "GB") suddenly gets 400 errors. Relaxing validation is safe; tightening it is a breaking change and needs the same care as a rename or removal.

Under the hood: expand-and-contract, step by step

The diagram earlier showed three phases as boxes on a timeline. This section makes each phase concrete — what code ships, what the wire looks like, how you detect that each phase is complete, and what tolerant-reader parsing looks like in practice.

The concrete scenario

The existing GET /v1/users/:id response contains fullName (a single string). The team wants to rename it displayName. There are three known server-side consumers (a web frontend, a mobile app, an internal reporting job) and an unknown number of third-party integrations. No version bump.

Phase 1 — Expand: dual-write both fields

Deploy a server change that emits both the old field and the new field, with identical values. Old consumers keep reading fullName and are unaffected. New consumers (including any you write today) can start reading displayName.

# Server response after Phase 1 deploy
{
  "id":          "usr_99",
  "fullName":    "Laila Ahmadi",     // OLD — still emitted
  "displayName": "Laila Ahmadi",  // NEW — added alongside
  "email":       "laila@example.com"
}

# Server-side pseudocode: write both from the same source field
def serialize_user(user):
    return {
        "id":          user.id,
        "fullName":    user.display_name,   # legacy alias
        "displayName": user.display_name,   # new canonical name
        "email":       user.email,
    }

This phase is safe to deploy at any time — no existing consumer breaks. The field fullName still has the same value, so code reading it still works. The deployment signal to advance to Phase 2: changelog published, all known consumer teams notified, and (ideally) a feature-flag or canary confirmed no regressions.

Phase 2 — Migrate readers: update each consumer to read the new field

Each consumer switches its read path from response.fullName to response.displayName. Both fields are still emitted, so this migration can be done one consumer at a time, in any order, with no coordinated downtime.

# Tolerant-reader parsing: the consumer only extracts what it needs
# and ignores everything else. This is what makes additive changes free.

# Python (Pydantic — extra fields ignored by default)
class UserResponse(BaseModel):
    id:          str
    displayName: str          # read new field
    email:       str
    # fullName not listed — Pydantic silently ignores it (model_config extra="ignore")

# Java (Jackson)
@JsonIgnoreProperties(ignoreUnknown = true)   // extra fields silently dropped
public class UserResponse {
    public String id;
    public String displayName;   // reads new field
    public String email;
    // fullName not declared — ignored
}

A non-tolerant client that has strict schema validation (e.g. Pydantic with extra="forbid", or a TypeScript type with a strict intersection) will reject a response that contains fullName because it was not in the expected shape. This is why Phase 1 is still safe when consumers are tolerant, but dangerous when they are strict. Audit your consumers' deserialization settings before Phase 1.

Signal to advance to Phase 3: access log analysis or server-side metrics confirm that no live traffic reads fullName. For mobile apps, this means the old app version (that reads fullName) has been sufficiently retired from your active user base. Instrument the server to log a warning whenever fullName is read — this makes the migration auditable.

Phase 3 — Contract: stop emitting the old field

Remove the fullName key from the server response. Any consumer that survived Phases 1 and 2 without switching is now broken — but you have empirical evidence (from Phase 2 monitoring) that none did.

# Server response after Phase 3 — clean
{
  "id":          "usr_99",
  "displayName": "Laila Ahmadi",
  "email":       "laila@example.com"
}

# Server-side: legacy alias removed
def serialize_user(user):
    return {
        "id":          user.id,
        "displayName": user.display_name,
        "email":       user.email,
    }

Phase timeline summary

PhaseServer emitsOld clientsNew clientsSafe to deploy?
Before Phase 1fullName onlyWork fineCannot use displayNameN/A
Phase 1 (Expand)Both fullName + displayNameWork fine (tolerant readers ignore new field)Can start reading displayNameYes
Phase 2 (Migrate)BothActively being migrated to read displayNameReading displayNameYes — rolling
Phase 3 (Contract)displayName onlyAny still reading fullName breakWork fineOnly after monitoring confirms zero fullName reads

How to debug & inspect it

Breaking changes almost always announce themselves through one of three failure modes: a consumer gets an unexpected undefined / null for a field they depend on; a consumer gets a 400 where they previously got 200 (validation tightened); or a consumer's data is silently wrong (semantic change — same field, different meaning). The debugging question is: which change broke which client, and when?

Detect a breaking change before you ship (contract tests)

The gold standard is consumer-driven contract testing (CDCT): each consumer publishes a "contract" describing exactly what fields and types it reads. A CI step on the producer side runs all consumer contracts against the new response shape and fails the build if any contract is violated. Tools: Pact (language-agnostic), Spring Cloud Contract.

$ pact verify --provider-base-url http://localhost:8080 \ --pact-broker-url https://pact-broker.internal \ --provider users-service Verifying 3 pacts for provider users-service... web-frontend (contract v12) ... passed mobile-app (contract v8) ... FAILED 1 interaction failed: Expected field 'fullName' to be present in response body but was absent reporting-job (contract v3) ... passed 2/3 passed. Build failed. # The mobile-app contract pins 'fullName'. Removing it (Phase 3) # before this contract is updated would break that consumer.

Audit field usage in server access logs

If you lack CDCT, server-side request/response logging is the fallback. Instrument the serializer to track which response fields are accessed by which client (User-Agent or client ID). This is invasive in REST (the server cannot tell what the client does with a JSON response), but you can approximate it by tracking which clients are sending the old field in request bodies, or by observing behavior changes (error rates) after a deploy.

$ grep '"fullName"' /var/log/api/responses.log | \ awk '{print $5}' | sort | uniq -c | sort -rn | head -10 2341 mobile-app/4.1.2 198 mobile-app/4.0.8 0 web-frontend/2025-06-10 # Mobile app versions 4.1.2 and 4.0.8 are still being served 'fullName' # Phase 3 is not safe yet — those app versions are still in use.
SymptomLikely causeFix
Consumer shows undefined/null for a field after server deployField was renamed or removed (breaking change) without expand-and-contractRoll back server; re-introduce the old field alongside the new one; follow the three-phase pattern
Consumer gets 400 on a request that previously workedValidation tightened (e.g. new enum value restriction, shorter max-length, required field added)Roll back; treat as a breaking change; use a new API version or feature flag
Consumer data is subtly wrong (wrong totals, wrong labels) — no errorsSemantic change: field name unchanged but meaning changedThis is the hardest to detect — requires consumer-side monitoring of output correctness; document semantic changes explicitly in changelogs
Contract test fails on CI after a server changeA consumer's contract pins a field/type that was modifiedDo not merge the server change; coordinate with the consumer team to update the contract first or apply expand-and-contract
Old API version traffic unexpectedly high months after sunset dateMobile app versions with long update cycles; internal scripts nobody knew aboutExport caller breakdown from access logs; reach out to high-traffic callers directly; extend sunset deadline if needed
Consumer strict schema rejects response after Phase 1 expandConsumer uses strict deserialization (extra="forbid") that errors on unknown fieldsFix the consumer's deserialization config to ignore unknown fields; this is a consumer bug, but you must work around it during the migration

Debug checklist:

  1. When a consumer reports breakage after a server deploy: check the git diff for the serializer — was a field removed or renamed in one step?
  2. Run consumer-driven contract tests locally against the proposed change before merging.
  3. For Phase 3 readiness: query access logs for any traffic reading the old field name; do not proceed until the count is zero (or acceptably low per your SLA).
  4. Check the Deprecation and Sunset response headers on the deprecated field/endpoint — are they present and correctly dated?
  5. For strict-schema consumer failures: confirm the consumer's JSON library is configured to ignore unknown properties.
  6. Add a server-side metric/log line every time the old deprecated field is included in a response — set an alert on its traffic so you know when it reaches zero.

🧠 Quick check

1. Which of these response changes is safe to ship without a version bump?

Adding an optional response field is a purely additive change. Tolerant readers ignore it; existing clients continue to work without modification. Renames and type changes break callers that depend on the old shape.

2. In the expand-and-contract pattern, when is it safe to remove the old field?

The whole point of the pattern is that the old field must co-exist with the new one for as long as any consumer still reads it. Removing it before all consumers migrate defeats the purpose. Time-based cutoffs ("one week") are unreliable — you need positive confirmation or monitoring data.

3. The Sunset HTTP header communicates:

RFC 8594 defines Sunset as the date/time after which the server will no longer honour requests to that resource. It is a machine-readable deprecation signal, distinct from Cache-Control (caching) or release notes (human history).

4. A tolerant reader client encounters a JSON response with a field it has never seen before. What should it do?

Tolerant readers ignore unknown fields. This is Postel's robustness principle applied to consumers: liberal in what you accept. Crashing on unknown fields makes additive changes on the server side de facto breaking changes for that client.

✍️ Exercise: design a safe migration for a field rename

You maintain a payments API. The current response for GET /v1/charges/:id includes a field called amountCents (an integer number of cents). Your team wants to rename it to amount and change it to a decimal string (e.g. "12.50") to make it easier for front-ends to display. You have approximately 40 known integrations and an unknown number of mobile app versions in the wild.

Sketch the full migration plan: what phases, what you ship in each phase, and what signals tell you it's safe to move to the next phase.


Model answer:

Phase 1 — Expand (safe to ship now): Add the new field amount (decimal string) to the response alongside the existing amountCents. Document both fields. Emit both in every response. No callers are broken; new integrations can start reading amount.

Phase 2 — Migrate consumers: Notify all 40 known integrations with a deadline, a migration guide, and the Deprecation + Sunset headers on responses. Monitor which integrations still send requests that only use amountCents (access logs or client-side analytics). For mobile apps, release updates and wait for adoption to reach an acceptable threshold (e.g. 99% of active sessions on a version that reads amount).

Phase 3 — Contract: Only after monitoring confirms no production traffic relies on amountCents, remove it from the response. Keep an API changelog entry explaining the removal date.

Rubric: ✓ Three distinct phases clearly named ✓ Both fields coexist during phase 2 ✓ Uses Sunset/Deprecation headers ✓ Identifies mobile app version lag as a specific risk ✓ Names a measurable signal (traffic/adoption) for moving phases ✓ Does not time-box phase 2 arbitrarily.

Key takeaways

Sources & further reading