Data & Formats · Lesson 02

Textual formats: JSON & XML

JSON conquered the web API world in under a decade — but XML never went away. Understanding both formats and their genuine trade-offs lets you make an informed choice instead of a reflex one.

⏱ 12 min Difficulty: core Prereq: df-01 — Data representation

By the end you'll be able to

Name JSON's six value types and explain why its simplicity made it the default web API format.
Identify the scenarios where XML still has a legitimate edge over JSON.
Describe the large-integer/float precision problem and the standard workaround.

JSON: the format that was already there

JSON — JavaScript Object Notation — won the web API wars not because it was invented in a standards body or optimized for computers, but because it was already inside every browser. When a JavaScript page needed to talk to a server in the early 2000s, parsing JSON was a one-liner: JSON.parse(text). That accidental head start became a permanent advantage once REST APIs and AJAX made browser-to-server calls routine.

The format itself is deliberately narrow. There are exactly six value types: string, number, boolean, null, array, and object. No dates, no binary blobs, no comments, no references. What looks like a limitation is actually the feature: any developer in any language can learn the entire format in an afternoon.

The JSON value tree. Root is always one of the six types; objects and arrays can nest the others to any depth.

Why JSON won

Four concrete reasons explain JSON's dominance over earlier XML-heavy REST and SOAP approaches:

Zero parsing friction in the browser. JSON.parse() and JSON.stringify() are built into every JavaScript engine. No library, no schema, no namespace declarations.
Maps directly to native types. A JSON object becomes a Python dict, a Go struct, a Ruby Hash — the binding is mechanical. XML nodes have no such natural mapping; you need an ORM-like layer.
Readable without tooling. A developer can paste a JSON blob into a browser console or VS Code and immediately see the structure. This reduces debugging time dramatically.
Minimal syntax surface. The entire grammar fits on a postcard. Fewer rules mean fewer parser bugs and interop surprises.

JSON Schema: adding structure without abandoning flexibility

JSON's lack of a built-in schema is its main weakness for server validation. JSON Schema fills the gap: it's a JSON document that describes what another JSON document should look like — required fields, value types, string patterns, numeric ranges. Swagger/OpenAPI uses JSON Schema under the hood to describe every request and response body in an API.

// JSON Schema fragment — validates a User object
{
  "type": "object",
  "required": ["id", "email"],
  "properties": {
    "id":    { "type": "integer", "minimum": 1 },
    "email": { "type": "string",  "format": "email" },
    "role":  { "type": "string",  "enum": ["admin", "viewer"] }
  },
  "additionalProperties": false
}

XML: the format that refused to die

XML (eXtensible Markup Language) predates JSON by a decade and was the default for web services throughout the 2000s via SOAP and WS-* protocols. It never really lost ground in the domains where it has genuine advantages.

XML excels at mixed content — text that contains markup interspersed with data, like a legal document with embedded annotations. JSON cannot represent "a paragraph of text with a bold section in the middle" without awkward workarounds; XML's element nesting handles it naturally. This is why the publishing industry, legal tech, and government document standards (DITA, DocBook, HL7 for healthcare) still use XML.

XML also has a mature ecosystem: XSD (XML Schema Definition) for strict structural validation, XSLT for declarative transformation, XPath/XQuery for querying. SOAP APIs built on these standards are still alive in banking, insurance, and enterprise middleware — not because XML is better, but because the toolchain is deeply embedded and migration risk is high.

Same document, two formats

Here is a simple invoice represented in both formats. Notice how XML's attributes and mixed-content nesting can describe a line item differently than JSON's flat key-value pairs, and how XML is nearly three times more verbose for this purely data-oriented example.

// JSON — 185 bytes
{
  "invoice": {
    "id":   "INV-2024-0042",
    "total": 149.99,
    "currency": "USD",
    "lines": [
      { "sku": "WDG-7", "qty": 2, "unit_price": 49.99 },
      { "sku": "SVC-1", "qty": 1, "unit_price": 50.01 }
    ]
  }
}

<!-- XML — 330 bytes -->
<invoice xmlns="urn:example:billing">
  <id>INV-2024-0042</id>
  <total currency="USD">149.99</total>
  <lines>
    <line sku="WDG-7" qty="2" unit_price="49.99"/>
    <line sku="SVC-1" qty="1" unit_price="50.01"/>
  </lines>
</invoice>

Trade-offs at a glance

Concern	JSON	XML
Verbosity	Low	High (closing tags, namespace declarations)
Browser-native	Yes	No — needs a parser
Mixed content	Awkward	First-class
Schema / validation	JSON Schema (external)	XSD (built into ecosystem)
Transformation	None built in	XSLT
Comments	Not allowed	Allowed
Binary data	Base64 string hack	Base64 or MTOM
Still used for	Web APIs, config, storage	SOAP, documents, HL7, DITA

⚠️ Common trap: large integers and float precision

JSON's number type maps to IEEE 754 double-precision float — 64-bit. That sounds generous until your database uses 64-bit integer IDs larger than 2⁵³. Above that threshold, JavaScript's Number type cannot represent every integer exactly, so 9007199254740993 silently becomes 9007199254740992 in a browser. The fix: send large integers as strings ("id": "9007199254740993") and document that convention explicitly. Similarly, avoid representing monetary amounts as floats — use integer cents or a string decimal instead.

The same trap bites dates: JSON has no date type. "2024-03-15" is just a string — nothing enforces it is a valid date or a consistent format. Agree on ISO 8601 in UTC ("2024-03-15T09:00:00Z") and validate with JSON Schema's "format": "date-time".

🎯 Interview angle

"JSON or XML — and when would you choose XML?" A strong answer: JSON for any new web or mobile API because of browser-native parsing and lower verbosity. XML when the domain requires it — SOAP integrations with legacy enterprise systems, document formats with mixed content (legal, publishing), or when the consuming team has deep XSD/XSLT tooling that would be expensive to replace. Mentioning SOAP and mixed content specifically signals real experience rather than cargo-culting JSON.

✅ Lock in your date and number conventions early

Add to your API style guide on day one: all timestamps in ISO 8601 UTC; all IDs that may exceed 2⁵³ serialized as strings; all monetary values as integer-minor-units (cents) or decimal strings. These decisions are trivial to make upfront and painful to change after clients depend on them.

Under the hood: how it actually works

When a runtime calls JSON.parse(), it runs two distinct phases. First, lexing/tokenizing: the parser scans bytes left-to-right, emitting a flat stream of tokens — STRING, NUMBER, TRUE, FALSE, NULL, {, }, [, ], :, and ,. There is no meaning yet, just classification. Second, value construction: the token stream is consumed recursively and assembled into the language's native type tree — a Python dict, a JavaScript object, a Go struct. The type a NUMBER token maps to is decided here, in the language runtime, not by the JSON spec itself. That is exactly where the IEEE-754 precision trap is introduced.

// IEEE-754 double precision: 53 bits of mantissa
// 2^53     = 9007199254740992  ← exactly representable
// 2^53 + 1 = 9007199254740993  ← requires 54 bits; rounds DOWN to 2^53

// JavaScript — silent precision loss
JSON.parse('{"id": 9007199254740993}').id
// → 9007199254740992  ✗ wrong! the last bit was silently dropped

// Safe fix: send the ID as a string
JSON.parse('{"id": "9007199254740993"}').id
// → "9007199254740993"  ✓ correct string; parse with BigInt() or int64
BigInt(JSON.parse('{"id": "9007199254740993"}').id)
// → 9007199254740993n  ✓

// Monetary amounts — IEEE-754 cannot represent 0.1 exactly
0.1 + 0.2
// → 0.30000000000000004  ✗  not 0.3
// So {"price": 0.1} decoded as float and summed gives wrong totals.
// Fix: use integer minor units  →  {"price_cents": 1099}  for $10.99

# Inspect numbers with jq — precision loss is visible echo '{"id":9007199254740993}' | jq '.id' 9007199254740992 # jq itself uses C double — it loses precision too echo '{"id":"9007199254740993"}' | jq -r '.id' 9007199254740993 # string field: value preserved exactly # Pretty-print a live API response curl -s https://api.example.com/v1/orders/42 | jq '.' { "order_id": "9007199254741001", "total_cents": 4999, ... } # Extract nested values from an array curl -s https://api.example.com/v1/orders/42 | jq '.items[].price' 1999 2999 # List top-level keys curl -s https://api.example.com/v1/users | jq 'keys' ["data", "meta", "pagination"]

A JSON Schema validator works by walking the schema tree and the document tree in parallel, applying three categories of checks at each node: type checks (is this value actually a string?), constraint checks (does this number meet minimum? does this string match pattern? does this string length respect maxLength?), and required-field checks (are all keys listed in "required" present?). When a check fails, the validator records the JSON Pointer path to the offending node — for example #/items/0/price — so errors are precise and actionable rather than vague "invalid document" messages.

# Validate with Python's jsonschema library
python3 -c "
import jsonschema, json
schema = json.load(open('schema.json'))
data   = json.load(open('response.json'))
jsonschema.validate(data, schema)
print('valid')
"
# A violation prints: jsonschema.exceptions.ValidationError: 'foo' is not of type 'integer'
# with a .json_path pointing to the exact failing field.

# ajv-cli (Node.js) works similarly:
npx ajv validate -s schema.json -d response.json

How to debug & inspect it

# 1. Confirm a response is valid JSON and see its structure curl -s https://api.example.com/v1/foo | python3 -m json.tool { "status": "ok", "count": 3 } # If invalid JSON, python3 -m json.tool exits non-zero with a clear error message. # 2. Find numbers that exceed 2^53 — candidates for string encoding curl -s https://api.example.com/v1/orders | jq '.. | numbers | select(. > 9007199254740991)' 9007199254741001 # Any match here is a precision risk on JavaScript clients. # 3. Duplicate-key check — JSON spec allows it; most parsers silently keep the last value python3 -c "import json; json.loads(open('resp.json').read())" # Standard library silently wins with the LAST occurrence of a duplicate key. # Use object_pairs_hook to detect dups explicitly (see jsonschema or custom hook). # 4. Inspect raw bytes for BOM or encoding issues curl -s https://api.example.com/v1/foo | file - /dev/stdin: JSON text data hexdump -C response.json | head -2 00000000 ef bb bf 7b 22 73 74 61 74 75 73 22 3a 22 6f 6b |...{"status":"ok| # ef bb bf = UTF-8 BOM — strip it; most JSON parsers reject or silently corrupt it.

Symptom	Cause	Fix
Large integer ID silently changes value on the client	Exceeds 2⁵³ (IEEE-754 double precision)	Serialize IDs > 2⁵³ as JSON strings; parse with `BigInt`/int64 on the client
Date fields parsed inconsistently across regions	Non-standard date format ("March 15 2024") interpreted by locale	Mandate ISO 8601 UTC (`"2024-03-15T00:00:00Z"`) in your API contract; validate with JSON Schema `"format": "date-time"`
Client reads wrong value when key appears twice	Duplicate keys: JSON spec technically allows them but parser behavior is undefined	Use a schema validator at ingress; add `"additionalProperties": false` to JSON Schema
Money calculation errors (0.1 + 0.2 ≠ 0.3)	Monetary amounts stored as IEEE-754 float in JSON	Use integer minor units (cents) or string decimal; never float for money
Parse error: "Unexpected token" / "Invalid character"	UTF-8 BOM (0xEF 0xBB 0xBF) at start of JSON file, or non-UTF-8 encoding	Strip BOM; ensure output is UTF-8 without BOM; set `Content-Type: application/json; charset=utf-8`

Pipe the response through jq '.' or python3 -m json.tool to confirm it is valid JSON and see structure.
Check for numbers larger than 9007199254740992 (2⁵³) that should be IDs — flag them for string encoding.
Verify all date fields use ISO 8601 UTC format.
Run a JSON Schema validator against the response if a schema exists — it will pinpoint type mismatches precisely.
For encoding issues, hexdump -C the first 4 bytes: a BOM starts with ef bb bf 7b.

🧠 Quick check

1. JSON has exactly six value types. Which list is correct?

JSON has no dedicated date or distinction between integer and float — both are "number." The six types are string, number, boolean, null, array, and object.

2. A client receives the JSON {"order_id": 9007199254741001}. What risk does this create for a JavaScript consumer?

JavaScript's Number is IEEE 754 double precision. Integers above 2^53 lose precision. The standard fix is to send large IDs as strings: "order_id": "9007199254741001".

3. A legal tech team asks you to design a format for contracts where clause text can contain embedded annotations and cross-references mid-sentence. Which format is a better fit and why?

Mixed content — e.g., a sentence like "See <ref id="3">clause 3</ref> for details" — is a first-class XML concept. JSON has no equivalent; you'd need an awkward array of text/markup objects. XML's design specifically anticipates this pattern.

4. What is the recommended way to represent a monetary amount (e.g., $12.99) in a JSON API to avoid floating-point errors?

Floating-point arithmetic cannot represent most decimal fractions exactly. The safe options are integer cents (1299) or a string decimal ("12.99"). Both are unambiguous and avoid representation errors. The two-element array is unusual and increases parsing complexity for consumers.

✍️ Exercise: audit a JSON API response for format problems

Review the following API response fragment and identify every format issue. Propose a corrected version.

{
  "user_id":    9007199254741234,
  "balance":   49.99,
  "joined":    "March 15 2024",
  "last_login": "15-03-2024 09:32",
  "active":    "yes"
}

Model answer — five issues:

user_id is a large integer exceeding 2⁵³; use "user_id": "9007199254741234" (string).
balance is a float that cannot represent 49.99 exactly in IEEE 754; use integer cents 4999 or string "49.99", and document the currency and unit.
joined is a non-standard date string; use ISO 8601 UTC: "2024-03-15T00:00:00Z".
last_login uses a locale-specific format with an ambiguous separator; use ISO 8601: "2024-03-15T09:32:00Z".
active is the string "yes" instead of the boolean true; use the JSON boolean type.

Rubric: ✓ identified large-integer risk ✓ identified float monetary risk ✓ both date fields flagged ✓ boolean-as-string flagged ✓ proposed concrete fixes. Four of five = solid; all five = exceptional.

Key takeaways

JSON has six types: string, number, boolean, null, array, object — no dates, no binary.
JSON won the web because it mapped cleanly to JavaScript and required zero tooling; simplicity is a feature.
XML earns its place in mixed-content documents and legacy SOAP/enterprise systems where XSD and XSLT provide real value.
Large integers above 2⁵³ must be sent as strings; monetary values should use integer minor units or string decimals.
JSON Schema adds optional validation without abandoning the format's flexibility.