Data & Formats · Lesson 01

Data representation & efficient communication

Your program holds a User object in RAM. The other program lives on a different continent. Before a single byte can travel the wire, that object must be flattened into a sequence of bytes — and faithfully rebuilt on the other end.

⏱ 11 min Difficulty: core Prereq: Lesson 01 — What an API really is

By the end you'll be able to

Explain why serialization is necessary and describe the full round trip.
Contrast schema-driven and schema-less formats and name one trade-off each.
Describe how Accept, Content-Type, and Content-Encoding headers coordinate format and compression between client and server.

The root problem: memory addresses don't travel

Inside your process, a User object might live at memory address 0x7ff3a.... A pointer to it is just a number — meaningful only within that one running program, on that one machine, right now. Send the raw memory to another host and you'll get garbage: different OS, different language runtime, different byte ordering, and the address points nowhere useful.

Think of it like sharing a recipe. You can't hand someone a photocopy of your brain's neural patterns for "how to make lasagna." You write the recipe down in words (serialize), mail it (transmit), and the reader builds their own mental model of the dish from the description (deserialize). The written form is the agreed, portable representation.

Serialization is the act of converting an in-memory data structure into a portable sequence of bytes. Deserialization is the reverse. Together they form the round trip that makes distributed systems possible.

The round trip: every API call walks this path. The bytes in the middle are the only thing that actually crosses the network.

Schema-driven vs schema-less formats

Once you accept that data must be encoded, the next question is: who defines what the bytes mean? There are two broad camps.

A schema-less format (like JSON or XML) bundles the field names alongside the values. Every message is self-describing — you can read it without any external definition. The cost: field names are repeated in every single message, adding bytes, and nothing stops a sender from silently changing "email" to "emailAddress" without warning.

A schema-driven format (like Protocol Buffers or Avro) replaces field names with small integers defined in a shared schema file. The message itself carries almost no metadata — if you don't have the schema, the bytes are opaque. The payoff is a dramatic size reduction and guaranteed shape: both sides compile the same schema, so a mismatch is caught at build time, not at 3 AM in production.

Property	Schema-less (JSON)	Schema-driven (Protobuf)
Human-readable	Yes	No (binary)
Self-describing	Yes	No — needs the .proto file
Typical size	Larger	Smaller (3–10×)
Schema enforcement	Optional (JSON Schema)	Built in
Tooling needed	None	Code generator

Text vs binary: the quick take

Text formats encode everything as readable characters (UTF-8 bytes). They're debuggable with curl and a browser; they're also bigger, because the number 1234567 occupies 7 bytes as text vs 4 bytes as a 32-bit integer. Binary formats pack values into their native machine representations — smaller and faster to parse, but opaque to the naked eye.

For most public web APIs, JSON wins on developer experience. For high-throughput internal services where every microsecond and every kilobyte matters, binary formats earn their keep. Lesson 03 of this module goes deep on binary options.

Compression: trading CPU for bandwidth

Even with an efficient format, payload bytes can still be large. A hotel bill is verbose text — but it compresses beautifully because repeated patterns (dates, the hotel name, currency symbols) collapse down. HTTP lets you apply the same idea to API payloads.

gzip (RFC 1952) and Brotli are the two dominant HTTP compression algorithms. gzip is universal; Brotli (developed by Google) typically shrinks text 15–25% more than gzip at comparable CPU cost. On a 200 KB JSON payload, compression commonly yields a 5–10× reduction — slashing transfer time at the price of one compression and one decompression step. On fast internal networks where bandwidth is cheap and latency is low, compression may not be worth the CPU; on mobile or intercontinental links, it nearly always is.

Content negotiation: how client and server agree

HTTP has a built-in three-way handshake for agreeing on format and encoding. Think of it as a waiter taking a dietary order before cooking:

The client sends Accept: application/json (or application/xml, etc.) — "I can eat this."
The client also sends Accept-Encoding: gzip, br — "I can digest these compressions."
The server responds with Content-Type: application/json and Content-Encoding: gzip — "Here's JSON, gzip-compressed."

Content-Type describes the format of the body. Content-Encoding describes the transfer encoding (compression) layered on top. They are independent: you can have JSON compressed with Brotli, or plain-text XML with no compression.

# Client signals what it can accept
GET /api/v1/report HTTP/1.1
Host:        api.example.com
Accept:      application/json
Accept-Encoding: br, gzip;q=0.9

# Server responds with compressed JSON
HTTP/1.1 200 OK
Content-Type:     application/json; charset=utf-8
Content-Encoding: br
Vary:             Accept-Encoding

# <Brotli-compressed JSON bytes>

The Vary: Accept-Encoding header tells caches that two requests for the same URL with different Accept-Encoding values should be stored separately — otherwise a cache might serve a gzip blob to a client that only understands plain text.

The same object: JSON vs compact binary

To make the size difference concrete, here is a minimal user object serialized two ways. The JSON field names travel with every message; the binary version encodes field identity as small integers defined in a schema file.

// JSON — 98 bytes (pretty-printed for readability)
{
  "id":    9001,
  "name": "Priya Sharma",
  "role": "admin",
  "active": true
}
// Minified JSON — 57 bytes
// {"id":9001,"name":"Priya Sharma","role":"admin","active":true}

// Equivalent Protobuf binary — ~22 bytes (field numbers 1-4)
// 08 c9 46  12 0c 50 72 69 79 61 20 53 68 61 72 6d 61
// 1a 05 61 64 6d 69 6e  20 01
// Saving: ~60% vs minified JSON — grows with repetition

⚠️ Common trap

Shipping large uncompressed payloads — or worse, over-fetching: returning 40 fields when the client needs 3. Both problems compound at scale. A 200 KB uncompressed JSON payload becomes 2 GB per 10 000 requests. Always ask: is the client actually using every field in this response? If not, trim the shape before reaching for compression.

🎯 Interview angle

"How would you cut payload size?" is a common system-design follow-up. A strong answer layers three levers in order: (1) trim the schema — return only what clients need; (2) choose a compact format — binary if the consumers can handle it; (3) enable compression — Content-Encoding: gzip or br at the gateway. Mentioning all three, and their respective trade-offs, signals depth.

✅ Always set both headers

When your server supports multiple formats, always inspect the client's Accept header and set Content-Type precisely in every response — including error responses. A client that asks for JSON and receives an HTML error page from your load balancer will fail in mysterious ways. Make the contract explicit in both directions.

Under the hood: how it actually works

When your server returns a User object, it walks a four-step path before any bits leave the machine: the in-memory object is serialized to a JSON string, that string is encoded as UTF-8 bytes, those bytes are compressed, and the compressed bytes become the HTTP response body. Understanding each step explains why compression ratios vary and why small payloads sometimes bloat instead of shrink.

// The User object — same one from the binary comparison above
// {"id":9001,"name":"Priya Sharma","role":"admin","active":true}

// Stage 1 — Pretty-printed JSON string (UTF-8 bytes)
//   {
//     "id":    9001,
//     "name": "Priya Sharma",
//     "role": "admin",
//     "active": true
//   }
//   Size: 98 bytes

// Stage 2 — Minified JSON (whitespace removed)
// {"id":9001,"name":"Priya Sharma","role":"admin","active":true}
//   Size: 57 bytes  (-42% vs pretty-printed)

// Stage 3 — gzip-compressed (DEFLATE: LZ77 + Huffman)
//   Size: ~78 bytes  (+37% vs minified — header overhead dominates!)
//   gzip adds a 10-byte header + 8-byte trailer; payload too small to benefit

// Stage 4 — Brotli-compressed (LZ77 + Huffman + static dictionary)
//   Size: ~60 bytes  (~same as minified — dictionary helps a little)
//   Still not a win at 57 bytes; compression pays off above ~150–200 bytes

// On a real 50 KB JSON report body:
//   Raw JSON:   51 200 bytes
//   gzip:       ~9 200 bytes  (~82% reduction)
//   Brotli:     ~7 700 bytes  (~85% reduction)

How gzip works. gzip uses the DEFLATE algorithm, which chains two techniques. First, LZ77 scans a 32 KB sliding window of recently seen bytes and replaces any repeated sequence with a (distance, length) back-reference — the word "name" appearing ten times becomes one literal plus nine tiny pointers. Second, Huffman coding assigns shorter bit patterns to the most frequent symbols, squeezing further. The 10-byte gzip header and 8-byte CRC trailer mean payloads below roughly 150 bytes often grow after compression — the overhead bytes exceed the savings from any redundancy found in such a short string.

How Brotli works. Brotli is also LZ77 + Huffman, but it ships with a built-in static dictionary of roughly 13 000 common web strings — substrings like Content-Type, application/json, "status", "message", and hundreds of HTML/HTTP tokens. A back-reference into that dictionary costs just a few bits, so Brotli can compress web content it has never seen before by recognising the vocabulary. This is why Brotli beats gzip by 15–25% on typical API and HTML payloads at comparable CPU cost: it arrives already knowing most of the words in the document.

# Full Accept-Encoding / Content-Encoding negotiation — traced with curl -v $ curl -v -H "Accept-Encoding: br, gzip;q=0.9" https://api.example.com/v1/users/9001 2>&1 > GET /v1/users/9001 HTTP/1.1 > Host: api.example.com > Accept-Encoding: br, gzip;q=0.9 ← client preference list: Brotli first < HTTP/1.1 200 OK < Content-Type: application/json; charset=utf-8 < Content-Encoding: br ← server chose Brotli (highest q-value match) < Vary: Accept-Encoding ← tells CDNs to cache encoding variants separately < Content-Length: 38 ← compressed wire size < * Connection #0 to host api.example.com left intact [38 bytes of Brotli-encoded data — curl --compressed decodes transparently] # Without Accept-Encoding — server sends plain JSON $ curl -v https://api.example.com/v1/users/9001 2>&1 | grep -E "(Content-|< HTTP)" < HTTP/1.1 200 OK < Content-Type: application/json; charset=utf-8 < Content-Length: 57 ← no Content-Encoding header = plain bytes

Scenario	Compression useful?	Why
Large JSON response (>1 KB)	Yes	Repeated field names and string patterns compress well; wire savings far outweigh CPU cost
Small API response (<200 bytes)	No	gzip header alone is 18 bytes; overhead bytes exceed any savings on a short payload
Already-compressed content (images, ZIP, video)	No	Binary formats have no exploitable redundancy; double-compression produces a larger result
High-CPU embedded / IoT device	No	Compression and decompression are CPU-bound; cycles are more constrained than bandwidth on low-power hardware
Intercontinental or mobile link	Yes	Round-trip latency is dominated by propagation delay; smaller payloads finish sooner and reduce per-byte mobile data cost

How to debug & inspect it

When compression is misbehaving — or you simply want to confirm it's working — curl gives you everything you need. The commands below let you measure actual wire sizes, inspect negotiation headers, and compare compressed vs uncompressed transfer times.

# 1. Uncompressed body size — no Accept-Encoding sent $ curl -s https://api.example.com/v1/report | wc -c 48312 # 47 KB of raw JSON bytes # 2. Decompressed body size — --compressed sends Accept-Encoding and auto-decodes # wc -c counts the decoded body; should match the uncompressed number above $ curl -s --compressed https://api.example.com/v1/report | wc -c 48312 # Same decoded size — confirms the server round-trips correctly # 3. Inspect negotiation headers only $ curl -v --compressed https://api.example.com/v1/report 2>&1 \ | grep -E "(Content-Encoding|Content-Length|Transfer-Encoding)" < Content-Encoding: br < Content-Length: 8104 # 8 KB on the wire vs 47 KB decoded — ~83% reduction # 4. Measure wire transfer size + total time in one shot $ curl -s -H "Accept-Encoding: br,gzip" https://api.example.com/v1/report \ -o /dev/null -w "%{size_download} bytes, %{time_total}s\n" 8104 bytes, 0.142s # Compare against without Accept-Encoding header: $ curl -s https://api.example.com/v1/report \ -o /dev/null -w "%{size_download} bytes, %{time_total}s\n" 48312 bytes, 0.381s # Brotli: 6× smaller payload, 2.7× faster on this link

Symptom	Likely cause	Fix
Server returns uncompressed body despite client sending `Accept-Encoding`	Compression not enabled in server or gateway config	Enable gzip/Brotli at the gateway — nginx: `gzip on; gzip_types application/json text/plain;`
Response is larger with compression enabled	Payload is too small (<~150 bytes) or content is already binary	Disable compression for small responses; set `gzip_min_length 256` in nginx
Client gets garbled response or JSON parse error	Server sent `Content-Encoding: gzip` but client did not decompress (bug in custom HTTP client)	Ensure the HTTP client auto-decompresses (most do when you pass the right flag), or manually pipe through `gunzip`
CDN serves wrong encoding to some clients	Missing `Vary: Accept-Encoding` header — CDN collapses encoding variants into one cached entry	Add `Vary: Accept-Encoding` to all compressed responses; purge existing cache entries
Brotli not working through a proxy	Proxy strips `Accept-Encoding` or does not support `br` token	Use gzip as the fallback encoding; check proxy `Accept-Encoding` passthrough configuration

Check whether a Content-Encoding header is present in the response — use curl -v or the DevTools Network tab → Response Headers pane.
Verify that the server config has compression enabled and the correct MIME types listed (e.g., application/json must be in gzip_types; it is not always included by default).
Measure actual wire size with curl -o /dev/null -w "%{size_download}" and compare against the uncompressed figure to confirm real savings.
Confirm Vary: Accept-Encoding is set so CDN caches do not serve a Brotli-encoded response to a client that only supports gzip or plain text.
For Brotli: verify that the reverse proxy (nginx or Caddy) has the Brotli module compiled in — it is not bundled by default in many nginx distributions. Apache requires mod_brotli.

By the numbers

Compression turns payload size into a bandwidth and latency trade-off. The governing formula:

bandwidth_saved_MB_s = QPS × payload_bytes × compression_ratio / 1_000_000 cpu_cost_ms_per_req ≈ 1–5 ms (gzip level 6) or 0.5–2 ms (Brotli quality 4) # server-side break_even_size ≈ 150 bytes # below this, gzip header overhead exceeds savings

Scenario: a reporting API returning user analytics. Each JSON response is 12 KB uncompressed. After gzip it compresses to ~3 KB — a 75% reduction, typical for verbose JSON with repeated field names.

Payload size → compressed size trace at 5,000 req/s:

Payload (raw)	gzip size	Ratio	Bandwidth saved @ 5 k req/s	Compress?
300 bytes (small JSON)	~310 bytes	bloats	−0.05 MB/s (worse)	No — overhead exceeds savings
800 bytes	~480 bytes	40%	+1.6 MB/s saved	Borderline — marginal gain
1.5 KB	~600 bytes	60%	+4.5 MB/s saved	Yes
12 KB (scenario)	~3 KB	75%	+45 MB/s saved	Yes — clear win
100 KB (large report)	~18 KB	82%	+410 MB/s saved	Yes — significant
JPEG image (100 KB)	~100 KB	~0%	≈ 0	No — already binary-compressed

Worked trace — 12 KB JSON at 5,000 req/s:

payload_raw = 12,288 bytes (12 KB) payload_gzip = 3,072 bytes (3 KB — 75% reduction) QPS = 5,000 requests/second # Bandwidth saved: bandwidth_saved = 5,000 req/s × (12,288 - 3,072) bytes = 5,000 × 9,216 = 46,080,000 bytes/s ≈ 45 MB/s saved on the wire # CPU cost on the server (gzip level 6, modern CPU ~200 MB/s throughput): compress_time_per_req = 12,288 bytes / (200 × 1024 × 1024 bytes/s) ≈ 0.06 ms per response total_cpu_overhead = 5,000 req/s × 0.06 ms = 300 ms of CPU per second = 0.03 CPU cores at 5k req/s ← negligible # Transfer time saving on a 10 Mbps mobile link (1.25 MB/s): time_raw = 12,288 / 1,250,000 = 9.8 ms per response time_gzip = 3,072 / 1,250,000 = 2.5 ms per response latency_saved_per_req = 9.8 - 2.5 = 7.3 ms per mobile client request

Decision math — when to compress:

Compress when: payload_raw > break_even_size (~150 bytes for gzip, ~100 bytes for Brotli) AND content is compressible (text, JSON, XML, HTML — not images, ZIP, video) AND bandwidth is the constraint (mobile link, cross-region, metered egress) Do NOT compress when: payload < ~150 bytes → gzip 18-byte header + 8-byte trailer bloats small responses content is already binary → JPEG, PNG, MP4, PDF, gzip-of-gzip yields zero saving CPU is the bottleneck on constrained hardware → IoT devices, serverless with tiny RAM Threshold formula: compression_worthwhile = (payload_raw × ratio) > gzip_overhead at ratio=0.75: payload_raw × 0.75 > 26 bytes → payload_raw > 35 bytes (gzip overhead only) in practice: set gzip_min_length 256 to skip tiny responses safely

Sources: nginx gzip module — gzip_min_length; RFC 1952 — GZIP format specification; RFC 7932 — Brotli Compressed Data Format; web.dev — Enable text compression.

🧠 Quick check

1. Why can't a program send a pointer (memory address) directly to another process over the network?

Memory addresses are virtual — they point into a process's own address space. A different process, language runtime, or machine maps memory entirely differently, so the address is meaningless outside its origin process.

2. A client sends Accept-Encoding: gzip, br. The server responds with Content-Encoding: br. What does that mean?

Content-Encoding tells the client how the body is wrapped. Because the client listed br in Accept-Encoding, the server knows the client can decompress Brotli. The client decompresses first, then parses the underlying format (e.g. JSON).

3. Which statement best describes a schema-driven binary format compared to schema-less JSON?

Schema-driven formats (Protobuf, Avro) encode field identities as compact numbers defined in a schema file. This shrinks messages dramatically but introduces a build-time dependency: both sides must share and agree on the same schema.

4. What does the Vary: Accept-Encoding response header instruct HTTP caches to do?

Vary tells caches which request headers affect the response. Accept-Encoding is listed so that a gzip-compressed response is not served to a client that only listed identity (no compression) in its own request.

✍️ Exercise: plan the serialization strategy for a high-traffic event stream

A mobile analytics platform receives 50 000 events per second. Each event has 8 fields: user_id (int), session_id (string, 16 chars), event_type (string, ~10 chars), timestamp (int64), x (float), y (float), screen (string, ~20 chars), app_version (string, ~8 chars). The consumer is an internal data pipeline you control entirely. Design the serialization and compression strategy. Justify each choice.

Model answer:

Format: Use a schema-driven binary format (Protocol Buffers or Avro). At 50K events/sec, even a 50-byte saving per event eliminates 2.5 MB/s. Field names in JSON (~75 bytes overhead per event) are never worth it for internal, high-throughput pipelines.
Compression: Batch events into chunks of ~1 000, then apply Snappy or LZ4 (not gzip) — both optimize for speed over compression ratio, which matters when the CPU bottleneck is real. Single-event gzip at 50K/s would be wasteful; batch compression amortizes the overhead.
Content negotiation: Not applicable for an internal message bus (Kafka/Kinesis). Set the schema ID in the message envelope instead; consumers look up the schema from a registry.
Schema evolution plan: Reserve field numbers 1–8 for current fields; commit to never renumbering. Add new fields at 9+ with sensible defaults so old consumers ignore them gracefully.

Rubric: ✓ chose binary format with justification ✓ named a fast compression algorithm appropriate for throughput ✓ noted that content negotiation headers don't apply to internal message buses ✓ addressed schema evolution ✓ cited approximate byte savings. Four of five = strong answer.

Key takeaways

Serialization exists because in-memory objects cannot travel; they must become bytes, cross the wire, and be rebuilt.
The round trip is always: serialize → transmit → deserialize.
Schema-less formats (JSON) are self-describing but verbose; schema-driven formats (Protobuf) are compact but require shared schemas.
Compression (gzip, Brotli) trades CPU cycles for smaller payloads — a good deal on bandwidth-limited links.
Accept / Content-Type negotiate format; Accept-Encoding / Content-Encoding negotiate compression — they are independent axes.