API Design

Data & Formats · Lesson 02

Textual formats: JSON & XML

JSON conquered the web API world in under a decade — but XML never went away. Understanding both formats and their genuine trade-offs lets you make an informed choice instead of a reflex one.

⏱ 12 min Difficulty: core Prereq: df-01 — Data representation

By the end you'll be able to

JSON: the format that was already there

JSON — JavaScript Object Notation — won the web API wars not because it was invented in a standards body or optimized for computers, but because it was already inside every browser. When a JavaScript page needed to talk to a server in the early 2000s, parsing JSON was a one-liner: JSON.parse(text). That accidental head start became a permanent advantage once REST APIs and AJAX made browser-to-server calls routine.

The format itself is deliberately narrow. There are exactly six value types: string, number, boolean, null, array, and object. No dates, no binary blobs, no comments, no references. What looks like a limitation is actually the feature: any developer in any language can learn the entire format in an afternoon.

object { } string "Ada" number 42 / 3.14 boolean true / false null null array [ ] ordered list object nested Every JSON value is one of exactly six types. Objects and arrays are the containers; the other four are leaves. Compose them arbitrarily deep — but the type system never grows beyond these six.
The JSON value tree. Root is always one of the six types; objects and arrays can nest the others to any depth.

Why JSON won

Four concrete reasons explain JSON's dominance over earlier XML-heavy REST and SOAP approaches:

  1. Zero parsing friction in the browser. JSON.parse() and JSON.stringify() are built into every JavaScript engine. No library, no schema, no namespace declarations.
  2. Maps directly to native types. A JSON object becomes a Python dict, a Go struct, a Ruby Hash — the binding is mechanical. XML nodes have no such natural mapping; you need an ORM-like layer.
  3. Readable without tooling. A developer can paste a JSON blob into a browser console or VS Code and immediately see the structure. This reduces debugging time dramatically.
  4. Minimal syntax surface. The entire grammar fits on a postcard. Fewer rules mean fewer parser bugs and interop surprises.

JSON Schema: adding structure without abandoning flexibility

JSON's lack of a built-in schema is its main weakness for server validation. JSON Schema fills the gap: it's a JSON document that describes what another JSON document should look like — required fields, value types, string patterns, numeric ranges. Swagger/OpenAPI uses JSON Schema under the hood to describe every request and response body in an API.

// JSON Schema fragment — validates a User object
{
  "type": "object",
  "required": ["id", "email"],
  "properties": {
    "id":    { "type": "integer", "minimum": 1 },
    "email": { "type": "string",  "format": "email" },
    "role":  { "type": "string",  "enum": ["admin", "viewer"] }
  },
  "additionalProperties": false
}

XML: the format that refused to die

XML (eXtensible Markup Language) predates JSON by a decade and was the default for web services throughout the 2000s via SOAP and WS-* protocols. It never really lost ground in the domains where it has genuine advantages.

XML excels at mixed content — text that contains markup interspersed with data, like a legal document with embedded annotations. JSON cannot represent "a paragraph of text with a bold section in the middle" without awkward workarounds; XML's element nesting handles it naturally. This is why the publishing industry, legal tech, and government document standards (DITA, DocBook, HL7 for healthcare) still use XML.

XML also has a mature ecosystem: XSD (XML Schema Definition) for strict structural validation, XSLT for declarative transformation, XPath/XQuery for querying. SOAP APIs built on these standards are still alive in banking, insurance, and enterprise middleware — not because XML is better, but because the toolchain is deeply embedded and migration risk is high.

Same document, two formats

Here is a simple invoice represented in both formats. Notice how XML's attributes and mixed-content nesting can describe a line item differently than JSON's flat key-value pairs, and how XML is nearly three times more verbose for this purely data-oriented example.

// JSON — 185 bytes
{
  "invoice": {
    "id":   "INV-2024-0042",
    "total": 149.99,
    "currency": "USD",
    "lines": [
      { "sku": "WDG-7", "qty": 2, "unit_price": 49.99 },
      { "sku": "SVC-1", "qty": 1, "unit_price": 50.01 }
    ]
  }
}
<!-- XML — 330 bytes -->
<invoice xmlns="urn:example:billing">
  <id>INV-2024-0042</id>
  <total currency="USD">149.99</total>
  <lines>
    <line sku="WDG-7" qty="2" unit_price="49.99"/>
    <line sku="SVC-1" qty="1" unit_price="50.01"/>
  </lines>
</invoice>

Trade-offs at a glance

ConcernJSONXML
VerbosityLowHigh (closing tags, namespace declarations)
Browser-nativeYesNo — needs a parser
Mixed contentAwkwardFirst-class
Schema / validationJSON Schema (external)XSD (built into ecosystem)
TransformationNone built inXSLT
CommentsNot allowedAllowed
Binary dataBase64 string hackBase64 or MTOM
Still used forWeb APIs, config, storageSOAP, documents, HL7, DITA
⚠️ Common trap: large integers and float precision

JSON's number type maps to IEEE 754 double-precision float — 64-bit. That sounds generous until your database uses 64-bit integer IDs larger than 253. Above that threshold, JavaScript's Number type cannot represent every integer exactly, so 9007199254740993 silently becomes 9007199254740992 in a browser. The fix: send large integers as strings ("id": "9007199254740993") and document that convention explicitly. Similarly, avoid representing monetary amounts as floats — use integer cents or a string decimal instead.

The same trap bites dates: JSON has no date type. "2024-03-15" is just a string — nothing enforces it is a valid date or a consistent format. Agree on ISO 8601 in UTC ("2024-03-15T09:00:00Z") and validate with JSON Schema's "format": "date-time".

🎯 Interview angle

"JSON or XML — and when would you choose XML?" A strong answer: JSON for any new web or mobile API because of browser-native parsing and lower verbosity. XML when the domain requires it — SOAP integrations with legacy enterprise systems, document formats with mixed content (legal, publishing), or when the consuming team has deep XSD/XSLT tooling that would be expensive to replace. Mentioning SOAP and mixed content specifically signals real experience rather than cargo-culting JSON.

✅ Lock in your date and number conventions early

Add to your API style guide on day one: all timestamps in ISO 8601 UTC; all IDs that may exceed 253 serialized as strings; all monetary values as integer-minor-units (cents) or decimal strings. These decisions are trivial to make upfront and painful to change after clients depend on them.

Under the hood: how it actually works

When a runtime calls JSON.parse(), it runs two distinct phases. First, lexing/tokenizing: the parser scans bytes left-to-right, emitting a flat stream of tokens — STRING, NUMBER, TRUE, FALSE, NULL, {, }, [, ], :, and ,. There is no meaning yet, just classification. Second, value construction: the token stream is consumed recursively and assembled into the language's native type tree — a Python dict, a JavaScript object, a Go struct. The type a NUMBER token maps to is decided here, in the language runtime, not by the JSON spec itself. That is exactly where the IEEE-754 precision trap is introduced.

// IEEE-754 double precision: 53 bits of mantissa
// 2^53     = 9007199254740992  ← exactly representable
// 2^53 + 1 = 9007199254740993  ← requires 54 bits; rounds DOWN to 2^53

// JavaScript — silent precision loss
JSON.parse('{"id": 9007199254740993}').id
// → 9007199254740992  ✗ wrong! the last bit was silently dropped

// Safe fix: send the ID as a string
JSON.parse('{"id": "9007199254740993"}').id
// → "9007199254740993"  ✓ correct string; parse with BigInt() or int64
BigInt(JSON.parse('{"id": "9007199254740993"}').id)
// → 9007199254740993n  ✓

// Monetary amounts — IEEE-754 cannot represent 0.1 exactly
0.1 + 0.2
// → 0.30000000000000004  ✗  not 0.3
// So {"price": 0.1} decoded as float and summed gives wrong totals.
// Fix: use integer minor units  →  {"price_cents": 1099}  for $10.99
# Inspect numbers with jq — precision loss is visible echo '{"id":9007199254740993}' | jq '.id' 9007199254740992 # jq itself uses C double — it loses precision too echo '{"id":"9007199254740993"}' | jq -r '.id' 9007199254740993 # string field: value preserved exactly # Pretty-print a live API response curl -s https://api.example.com/v1/orders/42 | jq '.' { "order_id": "9007199254741001", "total_cents": 4999, ... } # Extract nested values from an array curl -s https://api.example.com/v1/orders/42 | jq '.items[].price' 1999 2999 # List top-level keys curl -s https://api.example.com/v1/users | jq 'keys' ["data", "meta", "pagination"]

A JSON Schema validator works by walking the schema tree and the document tree in parallel, applying three categories of checks at each node: type checks (is this value actually a string?), constraint checks (does this number meet minimum? does this string match pattern? does this string length respect maxLength?), and required-field checks (are all keys listed in "required" present?). When a check fails, the validator records the JSON Pointer path to the offending node — for example #/items/0/price — so errors are precise and actionable rather than vague "invalid document" messages.

# Validate with Python's jsonschema library
python3 -c "
import jsonschema, json
schema = json.load(open('schema.json'))
data   = json.load(open('response.json'))
jsonschema.validate(data, schema)
print('valid')
"
# A violation prints: jsonschema.exceptions.ValidationError: 'foo' is not of type 'integer'
# with a .json_path pointing to the exact failing field.

# ajv-cli (Node.js) works similarly:
npx ajv validate -s schema.json -d response.json

How to debug & inspect it

# 1. Confirm a response is valid JSON and see its structure curl -s https://api.example.com/v1/foo | python3 -m json.tool { "status": "ok", "count": 3 } # If invalid JSON, python3 -m json.tool exits non-zero with a clear error message. # 2. Find numbers that exceed 2^53 — candidates for string encoding curl -s https://api.example.com/v1/orders | jq '.. | numbers | select(. > 9007199254740991)' 9007199254741001 # Any match here is a precision risk on JavaScript clients. # 3. Duplicate-key check — JSON spec allows it; most parsers silently keep the last value python3 -c "import json; json.loads(open('resp.json').read())" # Standard library silently wins with the LAST occurrence of a duplicate key. # Use object_pairs_hook to detect dups explicitly (see jsonschema or custom hook). # 4. Inspect raw bytes for BOM or encoding issues curl -s https://api.example.com/v1/foo | file - /dev/stdin: JSON text data hexdump -C response.json | head -2 00000000 ef bb bf 7b 22 73 74 61 74 75 73 22 3a 22 6f 6b |...{"status":"ok| # ef bb bf = UTF-8 BOM — strip it; most JSON parsers reject or silently corrupt it.
Symptom Cause Fix
Large integer ID silently changes value on the client Exceeds 253 (IEEE-754 double precision) Serialize IDs > 253 as JSON strings; parse with BigInt/int64 on the client
Date fields parsed inconsistently across regions Non-standard date format ("March 15 2024") interpreted by locale Mandate ISO 8601 UTC ("2024-03-15T00:00:00Z") in your API contract; validate with JSON Schema "format": "date-time"
Client reads wrong value when key appears twice Duplicate keys: JSON spec technically allows them but parser behavior is undefined Use a schema validator at ingress; add "additionalProperties": false to JSON Schema
Money calculation errors (0.1 + 0.2 ≠ 0.3) Monetary amounts stored as IEEE-754 float in JSON Use integer minor units (cents) or string decimal; never float for money
Parse error: "Unexpected token" / "Invalid character" UTF-8 BOM (0xEF 0xBB 0xBF) at start of JSON file, or non-UTF-8 encoding Strip BOM; ensure output is UTF-8 without BOM; set Content-Type: application/json; charset=utf-8
  1. Pipe the response through jq '.' or python3 -m json.tool to confirm it is valid JSON and see structure.
  2. Check for numbers larger than 9007199254740992 (253) that should be IDs — flag them for string encoding.
  3. Verify all date fields use ISO 8601 UTC format.
  4. Run a JSON Schema validator against the response if a schema exists — it will pinpoint type mismatches precisely.
  5. For encoding issues, hexdump -C the first 4 bytes: a BOM starts with ef bb bf 7b.

🧠 Quick check

1. JSON has exactly six value types. Which list is correct?

JSON has no dedicated date or distinction between integer and float — both are "number." The six types are string, number, boolean, null, array, and object.

2. A client receives the JSON {"order_id": 9007199254741001}. What risk does this create for a JavaScript consumer?

JavaScript's Number is IEEE 754 double precision. Integers above 2^53 lose precision. The standard fix is to send large IDs as strings: "order_id": "9007199254741001".

3. A legal tech team asks you to design a format for contracts where clause text can contain embedded annotations and cross-references mid-sentence. Which format is a better fit and why?

Mixed content — e.g., a sentence like "See <ref id="3">clause 3</ref> for details" — is a first-class XML concept. JSON has no equivalent; you'd need an awkward array of text/markup objects. XML's design specifically anticipates this pattern.

4. What is the recommended way to represent a monetary amount (e.g., $12.99) in a JSON API to avoid floating-point errors?

Floating-point arithmetic cannot represent most decimal fractions exactly. The safe options are integer cents (1299) or a string decimal ("12.99"). Both are unambiguous and avoid representation errors. The two-element array is unusual and increases parsing complexity for consumers.

✍️ Exercise: audit a JSON API response for format problems

Review the following API response fragment and identify every format issue. Propose a corrected version.

{
  "user_id":    9007199254741234,
  "balance":   49.99,
  "joined":    "March 15 2024",
  "last_login": "15-03-2024 09:32",
  "active":    "yes"
}

Model answer — five issues:

  1. user_id is a large integer exceeding 253; use "user_id": "9007199254741234" (string).
  2. balance is a float that cannot represent 49.99 exactly in IEEE 754; use integer cents 4999 or string "49.99", and document the currency and unit.
  3. joined is a non-standard date string; use ISO 8601 UTC: "2024-03-15T00:00:00Z".
  4. last_login uses a locale-specific format with an ambiguous separator; use ISO 8601: "2024-03-15T09:32:00Z".
  5. active is the string "yes" instead of the boolean true; use the JSON boolean type.

Rubric: ✓ identified large-integer risk ✓ identified float monetary risk ✓ both date fields flagged ✓ boolean-as-string flagged ✓ proposed concrete fixes. Four of five = solid; all five = exceptional.

Key takeaways

Sources & further reading