Data & Formats · Lesson 02
Textual formats: JSON & XML
JSON conquered the web API world in under a decade — but XML never went away. Understanding both formats and their genuine trade-offs lets you make an informed choice instead of a reflex one.
By the end you'll be able to
- Name JSON's six value types and explain why its simplicity made it the default web API format.
- Identify the scenarios where XML still has a legitimate edge over JSON.
- Describe the large-integer/float precision problem and the standard workaround.
JSON: the format that was already there
JSON — JavaScript Object Notation — won the web API wars not because it was invented in a standards body or optimized for computers, but because it was already inside every browser. When a JavaScript page needed to talk to a server in the early 2000s, parsing JSON was a one-liner: JSON.parse(text). That accidental head start became a permanent advantage once REST APIs and AJAX made browser-to-server calls routine.
The format itself is deliberately narrow. There are exactly six value types: string, number, boolean, null, array, and object. No dates, no binary blobs, no comments, no references. What looks like a limitation is actually the feature: any developer in any language can learn the entire format in an afternoon.
Why JSON won
Four concrete reasons explain JSON's dominance over earlier XML-heavy REST and SOAP approaches:
- Zero parsing friction in the browser.
JSON.parse()andJSON.stringify()are built into every JavaScript engine. No library, no schema, no namespace declarations. - Maps directly to native types. A JSON object becomes a Python dict, a Go struct, a Ruby Hash — the binding is mechanical. XML nodes have no such natural mapping; you need an ORM-like layer.
- Readable without tooling. A developer can paste a JSON blob into a browser console or VS Code and immediately see the structure. This reduces debugging time dramatically.
- Minimal syntax surface. The entire grammar fits on a postcard. Fewer rules mean fewer parser bugs and interop surprises.
JSON Schema: adding structure without abandoning flexibility
JSON's lack of a built-in schema is its main weakness for server validation. JSON Schema fills the gap: it's a JSON document that describes what another JSON document should look like — required fields, value types, string patterns, numeric ranges. Swagger/OpenAPI uses JSON Schema under the hood to describe every request and response body in an API.
// JSON Schema fragment — validates a User object
{
"type": "object",
"required": ["id", "email"],
"properties": {
"id": { "type": "integer", "minimum": 1 },
"email": { "type": "string", "format": "email" },
"role": { "type": "string", "enum": ["admin", "viewer"] }
},
"additionalProperties": false
}
XML: the format that refused to die
XML (eXtensible Markup Language) predates JSON by a decade and was the default for web services throughout the 2000s via SOAP and WS-* protocols. It never really lost ground in the domains where it has genuine advantages.
XML excels at mixed content — text that contains markup interspersed with data, like a legal document with embedded annotations. JSON cannot represent "a paragraph of text with a bold section in the middle" without awkward workarounds; XML's element nesting handles it naturally. This is why the publishing industry, legal tech, and government document standards (DITA, DocBook, HL7 for healthcare) still use XML.
XML also has a mature ecosystem: XSD (XML Schema Definition) for strict structural validation, XSLT for declarative transformation, XPath/XQuery for querying. SOAP APIs built on these standards are still alive in banking, insurance, and enterprise middleware — not because XML is better, but because the toolchain is deeply embedded and migration risk is high.
Same document, two formats
Here is a simple invoice represented in both formats. Notice how XML's attributes and mixed-content nesting can describe a line item differently than JSON's flat key-value pairs, and how XML is nearly three times more verbose for this purely data-oriented example.
// JSON — 185 bytes
{
"invoice": {
"id": "INV-2024-0042",
"total": 149.99,
"currency": "USD",
"lines": [
{ "sku": "WDG-7", "qty": 2, "unit_price": 49.99 },
{ "sku": "SVC-1", "qty": 1, "unit_price": 50.01 }
]
}
}
<!-- XML — 330 bytes -->
<invoice xmlns="urn:example:billing">
<id>INV-2024-0042</id>
<total currency="USD">149.99</total>
<lines>
<line sku="WDG-7" qty="2" unit_price="49.99"/>
<line sku="SVC-1" qty="1" unit_price="50.01"/>
</lines>
</invoice>
Trade-offs at a glance
| Concern | JSON | XML |
|---|---|---|
| Verbosity | Low | High (closing tags, namespace declarations) |
| Browser-native | Yes | No — needs a parser |
| Mixed content | Awkward | First-class |
| Schema / validation | JSON Schema (external) | XSD (built into ecosystem) |
| Transformation | None built in | XSLT |
| Comments | Not allowed | Allowed |
| Binary data | Base64 string hack | Base64 or MTOM |
| Still used for | Web APIs, config, storage | SOAP, documents, HL7, DITA |
JSON's number type maps to IEEE 754 double-precision float — 64-bit. That sounds generous until your database uses 64-bit integer IDs larger than 253. Above that threshold, JavaScript's Number type cannot represent every integer exactly, so 9007199254740993 silently becomes 9007199254740992 in a browser. The fix: send large integers as strings ("id": "9007199254740993") and document that convention explicitly. Similarly, avoid representing monetary amounts as floats — use integer cents or a string decimal instead.
The same trap bites dates: JSON has no date type. "2024-03-15" is just a string — nothing enforces it is a valid date or a consistent format. Agree on ISO 8601 in UTC ("2024-03-15T09:00:00Z") and validate with JSON Schema's "format": "date-time".
"JSON or XML — and when would you choose XML?" A strong answer: JSON for any new web or mobile API because of browser-native parsing and lower verbosity. XML when the domain requires it — SOAP integrations with legacy enterprise systems, document formats with mixed content (legal, publishing), or when the consuming team has deep XSD/XSLT tooling that would be expensive to replace. Mentioning SOAP and mixed content specifically signals real experience rather than cargo-culting JSON.
Add to your API style guide on day one: all timestamps in ISO 8601 UTC; all IDs that may exceed 253 serialized as strings; all monetary values as integer-minor-units (cents) or decimal strings. These decisions are trivial to make upfront and painful to change after clients depend on them.
Under the hood: how it actually works
When a runtime calls JSON.parse(), it runs two distinct phases. First, lexing/tokenizing: the parser scans bytes left-to-right, emitting a flat stream of tokens — STRING, NUMBER, TRUE, FALSE, NULL, {, }, [, ], :, and ,. There is no meaning yet, just classification. Second, value construction: the token stream is consumed recursively and assembled into the language's native type tree — a Python dict, a JavaScript object, a Go struct. The type a NUMBER token maps to is decided here, in the language runtime, not by the JSON spec itself. That is exactly where the IEEE-754 precision trap is introduced.
// IEEE-754 double precision: 53 bits of mantissa
// 2^53 = 9007199254740992 ← exactly representable
// 2^53 + 1 = 9007199254740993 ← requires 54 bits; rounds DOWN to 2^53
// JavaScript — silent precision loss
JSON.parse('{"id": 9007199254740993}').id
// → 9007199254740992 ✗ wrong! the last bit was silently dropped
// Safe fix: send the ID as a string
JSON.parse('{"id": "9007199254740993"}').id
// → "9007199254740993" ✓ correct string; parse with BigInt() or int64
BigInt(JSON.parse('{"id": "9007199254740993"}').id)
// → 9007199254740993n ✓
// Monetary amounts — IEEE-754 cannot represent 0.1 exactly
0.1 + 0.2
// → 0.30000000000000004 ✗ not 0.3
// So {"price": 0.1} decoded as float and summed gives wrong totals.
// Fix: use integer minor units → {"price_cents": 1099} for $10.99
A JSON Schema validator works by walking the schema tree and the document tree in parallel, applying three categories of checks at each node: type checks (is this value actually a string?), constraint checks (does this number meet minimum? does this string match pattern? does this string length respect maxLength?), and required-field checks (are all keys listed in "required" present?). When a check fails, the validator records the JSON Pointer path to the offending node — for example #/items/0/price — so errors are precise and actionable rather than vague "invalid document" messages.
# Validate with Python's jsonschema library
python3 -c "
import jsonschema, json
schema = json.load(open('schema.json'))
data = json.load(open('response.json'))
jsonschema.validate(data, schema)
print('valid')
"
# A violation prints: jsonschema.exceptions.ValidationError: 'foo' is not of type 'integer'
# with a .json_path pointing to the exact failing field.
# ajv-cli (Node.js) works similarly:
npx ajv validate -s schema.json -d response.json
How to debug & inspect it
| Symptom | Cause | Fix |
|---|---|---|
| Large integer ID silently changes value on the client | Exceeds 253 (IEEE-754 double precision) | Serialize IDs > 253 as JSON strings; parse with BigInt/int64 on the client |
| Date fields parsed inconsistently across regions | Non-standard date format ("March 15 2024") interpreted by locale | Mandate ISO 8601 UTC ("2024-03-15T00:00:00Z") in your API contract; validate with JSON Schema "format": "date-time" |
| Client reads wrong value when key appears twice | Duplicate keys: JSON spec technically allows them but parser behavior is undefined | Use a schema validator at ingress; add "additionalProperties": false to JSON Schema |
| Money calculation errors (0.1 + 0.2 ≠ 0.3) | Monetary amounts stored as IEEE-754 float in JSON | Use integer minor units (cents) or string decimal; never float for money |
| Parse error: "Unexpected token" / "Invalid character" | UTF-8 BOM (0xEF 0xBB 0xBF) at start of JSON file, or non-UTF-8 encoding | Strip BOM; ensure output is UTF-8 without BOM; set Content-Type: application/json; charset=utf-8 |
- Pipe the response through
jq '.'orpython3 -m json.toolto confirm it is valid JSON and see structure. - Check for numbers larger than
9007199254740992(253) that should be IDs — flag them for string encoding. - Verify all date fields use ISO 8601 UTC format.
- Run a JSON Schema validator against the response if a schema exists — it will pinpoint type mismatches precisely.
- For encoding issues,
hexdump -Cthe first 4 bytes: a BOM starts withef bb bf 7b.
🧠 Quick check
1. JSON has exactly six value types. Which list is correct?
JSON has no dedicated date or distinction between integer and float — both are "number." The six types are string, number, boolean, null, array, and object.
2. A client receives the JSON {"order_id": 9007199254741001}. What risk does this create for a JavaScript consumer?
JavaScript's Number is IEEE 754 double precision. Integers above 2^53 lose precision. The standard fix is to send large IDs as strings: "order_id": "9007199254741001".
3. A legal tech team asks you to design a format for contracts where clause text can contain embedded annotations and cross-references mid-sentence. Which format is a better fit and why?
Mixed content — e.g., a sentence like "See <ref id="3">clause 3</ref> for details" — is a first-class XML concept. JSON has no equivalent; you'd need an awkward array of text/markup objects. XML's design specifically anticipates this pattern.
4. What is the recommended way to represent a monetary amount (e.g., $12.99) in a JSON API to avoid floating-point errors?
Floating-point arithmetic cannot represent most decimal fractions exactly. The safe options are integer cents (1299) or a string decimal ("12.99"). Both are unambiguous and avoid representation errors. The two-element array is unusual and increases parsing complexity for consumers.
✍️ Exercise: audit a JSON API response for format problems
Review the following API response fragment and identify every format issue. Propose a corrected version.
{
"user_id": 9007199254741234,
"balance": 49.99,
"joined": "March 15 2024",
"last_login": "15-03-2024 09:32",
"active": "yes"
}
Model answer — five issues:
user_idis a large integer exceeding 253; use"user_id": "9007199254741234"(string).balanceis a float that cannot represent 49.99 exactly in IEEE 754; use integer cents4999or string"49.99", and document the currency and unit.joinedis a non-standard date string; use ISO 8601 UTC:"2024-03-15T00:00:00Z".last_loginuses a locale-specific format with an ambiguous separator; use ISO 8601:"2024-03-15T09:32:00Z".activeis the string"yes"instead of the booleantrue; use the JSON boolean type.
Rubric: ✓ identified large-integer risk ✓ identified float monetary risk ✓ both date fields flagged ✓ boolean-as-string flagged ✓ proposed concrete fixes. Four of five = solid; all five = exceptional.
Key takeaways
- JSON has six types: string, number, boolean, null, array, object — no dates, no binary.
- JSON won the web because it mapped cleanly to JavaScript and required zero tooling; simplicity is a feature.
- XML earns its place in mixed-content documents and legacy SOAP/enterprise systems where XSD and XSLT provide real value.
- Large integers above 253 must be sent as strings; monetary values should use integer minor units or string decimals.
- JSON Schema adds optional validation without abandoning the format's flexibility.