Data & Formats · Lesson 01
Data representation & efficient communication
Your program holds a User object in RAM. The other program lives on a different continent. Before a single byte can travel the wire, that object must be flattened into a sequence of bytes — and faithfully rebuilt on the other end.
By the end you'll be able to
- Explain why serialization is necessary and describe the full round trip.
- Contrast schema-driven and schema-less formats and name one trade-off each.
- Describe how
Accept,Content-Type, andContent-Encodingheaders coordinate format and compression between client and server.
The root problem: memory addresses don't travel
Inside your process, a User object might live at memory address 0x7ff3a.... A pointer to it is just a number — meaningful only within that one running program, on that one machine, right now. Send the raw memory to another host and you'll get garbage: different OS, different language runtime, different byte ordering, and the address points nowhere useful.
Think of it like sharing a recipe. You can't hand someone a photocopy of your brain's neural patterns for "how to make lasagna." You write the recipe down in words (serialize), mail it (transmit), and the reader builds their own mental model of the dish from the description (deserialize). The written form is the agreed, portable representation.
Serialization is the act of converting an in-memory data structure into a portable sequence of bytes. Deserialization is the reverse. Together they form the round trip that makes distributed systems possible.
Schema-driven vs schema-less formats
Once you accept that data must be encoded, the next question is: who defines what the bytes mean? There are two broad camps.
A schema-less format (like JSON or XML) bundles the field names alongside the values. Every message is self-describing — you can read it without any external definition. The cost: field names are repeated in every single message, adding bytes, and nothing stops a sender from silently changing "email" to "emailAddress" without warning.
A schema-driven format (like Protocol Buffers or Avro) replaces field names with small integers defined in a shared schema file. The message itself carries almost no metadata — if you don't have the schema, the bytes are opaque. The payoff is a dramatic size reduction and guaranteed shape: both sides compile the same schema, so a mismatch is caught at build time, not at 3 AM in production.
| Property | Schema-less (JSON) | Schema-driven (Protobuf) |
|---|---|---|
| Human-readable | Yes | No (binary) |
| Self-describing | Yes | No — needs the .proto file |
| Typical size | Larger | Smaller (3–10×) |
| Schema enforcement | Optional (JSON Schema) | Built in |
| Tooling needed | None | Code generator |
Text vs binary: the quick take
Text formats encode everything as readable characters (UTF-8 bytes). They're debuggable with curl and a browser; they're also bigger, because the number 1234567 occupies 7 bytes as text vs 4 bytes as a 32-bit integer. Binary formats pack values into their native machine representations — smaller and faster to parse, but opaque to the naked eye.
For most public web APIs, JSON wins on developer experience. For high-throughput internal services where every microsecond and every kilobyte matters, binary formats earn their keep. Lesson 03 of this module goes deep on binary options.
Compression: trading CPU for bandwidth
Even with an efficient format, payload bytes can still be large. A hotel bill is verbose text — but it compresses beautifully because repeated patterns (dates, the hotel name, currency symbols) collapse down. HTTP lets you apply the same idea to API payloads.
gzip (RFC 1952) and Brotli are the two dominant HTTP compression algorithms. gzip is universal; Brotli (developed by Google) typically shrinks text 15–25% more than gzip at comparable CPU cost. On a 200 KB JSON payload, compression commonly yields a 5–10× reduction — slashing transfer time at the price of one compression and one decompression step. On fast internal networks where bandwidth is cheap and latency is low, compression may not be worth the CPU; on mobile or intercontinental links, it nearly always is.
Content negotiation: how client and server agree
HTTP has a built-in three-way handshake for agreeing on format and encoding. Think of it as a waiter taking a dietary order before cooking:
- The client sends
Accept: application/json(orapplication/xml, etc.) — "I can eat this." - The client also sends
Accept-Encoding: gzip, br— "I can digest these compressions." - The server responds with
Content-Type: application/jsonandContent-Encoding: gzip— "Here's JSON, gzip-compressed."
Content-Type describes the format of the body. Content-Encoding describes the transfer encoding (compression) layered on top. They are independent: you can have JSON compressed with Brotli, or plain-text XML with no compression.
# Client signals what it can accept
GET /api/v1/report HTTP/1.1
Host: api.example.com
Accept: application/json
Accept-Encoding: br, gzip;q=0.9
# Server responds with compressed JSON
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Encoding: br
Vary: Accept-Encoding
# <Brotli-compressed JSON bytes>
The Vary: Accept-Encoding header tells caches that two requests for the same URL with different Accept-Encoding values should be stored separately — otherwise a cache might serve a gzip blob to a client that only understands plain text.
The same object: JSON vs compact binary
To make the size difference concrete, here is a minimal user object serialized two ways. The JSON field names travel with every message; the binary version encodes field identity as small integers defined in a schema file.
// JSON — 98 bytes (pretty-printed for readability)
{
"id": 9001,
"name": "Priya Sharma",
"role": "admin",
"active": true
}
// Minified JSON — 57 bytes
// {"id":9001,"name":"Priya Sharma","role":"admin","active":true}
// Equivalent Protobuf binary — ~22 bytes (field numbers 1-4)
// 08 c9 46 12 0c 50 72 69 79 61 20 53 68 61 72 6d 61
// 1a 05 61 64 6d 69 6e 20 01
// Saving: ~60% vs minified JSON — grows with repetition
Shipping large uncompressed payloads — or worse, over-fetching: returning 40 fields when the client needs 3. Both problems compound at scale. A 200 KB uncompressed JSON payload becomes 2 GB per 10 000 requests. Always ask: is the client actually using every field in this response? If not, trim the shape before reaching for compression.
"How would you cut payload size?" is a common system-design follow-up. A strong answer layers three levers in order: (1) trim the schema — return only what clients need; (2) choose a compact format — binary if the consumers can handle it; (3) enable compression — Content-Encoding: gzip or br at the gateway. Mentioning all three, and their respective trade-offs, signals depth.
When your server supports multiple formats, always inspect the client's Accept header and set Content-Type precisely in every response — including error responses. A client that asks for JSON and receives an HTML error page from your load balancer will fail in mysterious ways. Make the contract explicit in both directions.
Under the hood: how it actually works
When your server returns a User object, it walks a four-step path before any bits leave the machine: the in-memory object is serialized to a JSON string, that string is encoded as UTF-8 bytes, those bytes are compressed, and the compressed bytes become the HTTP response body. Understanding each step explains why compression ratios vary and why small payloads sometimes bloat instead of shrink.
// The User object — same one from the binary comparison above
// {"id":9001,"name":"Priya Sharma","role":"admin","active":true}
// Stage 1 — Pretty-printed JSON string (UTF-8 bytes)
// {
// "id": 9001,
// "name": "Priya Sharma",
// "role": "admin",
// "active": true
// }
// Size: 98 bytes
// Stage 2 — Minified JSON (whitespace removed)
// {"id":9001,"name":"Priya Sharma","role":"admin","active":true}
// Size: 57 bytes (-42% vs pretty-printed)
// Stage 3 — gzip-compressed (DEFLATE: LZ77 + Huffman)
// Size: ~78 bytes (+37% vs minified — header overhead dominates!)
// gzip adds a 10-byte header + 8-byte trailer; payload too small to benefit
// Stage 4 — Brotli-compressed (LZ77 + Huffman + static dictionary)
// Size: ~60 bytes (~same as minified — dictionary helps a little)
// Still not a win at 57 bytes; compression pays off above ~150–200 bytes
// On a real 50 KB JSON report body:
// Raw JSON: 51 200 bytes
// gzip: ~9 200 bytes (~82% reduction)
// Brotli: ~7 700 bytes (~85% reduction)
How gzip works. gzip uses the DEFLATE algorithm, which chains two techniques. First, LZ77 scans a 32 KB sliding window of recently seen bytes and replaces any repeated sequence with a (distance, length) back-reference — the word "name" appearing ten times becomes one literal plus nine tiny pointers. Second, Huffman coding assigns shorter bit patterns to the most frequent symbols, squeezing further. The 10-byte gzip header and 8-byte CRC trailer mean payloads below roughly 150 bytes often grow after compression — the overhead bytes exceed the savings from any redundancy found in such a short string.
How Brotli works. Brotli is also LZ77 + Huffman, but it ships with a built-in static dictionary of roughly 13 000 common web strings — substrings like Content-Type, application/json, "status", "message", and hundreds of HTML/HTTP tokens. A back-reference into that dictionary costs just a few bits, so Brotli can compress web content it has never seen before by recognising the vocabulary. This is why Brotli beats gzip by 15–25% on typical API and HTML payloads at comparable CPU cost: it arrives already knowing most of the words in the document.
| Scenario | Compression useful? | Why |
|---|---|---|
| Large JSON response (>1 KB) | Yes | Repeated field names and string patterns compress well; wire savings far outweigh CPU cost |
| Small API response (<200 bytes) | No | gzip header alone is 18 bytes; overhead bytes exceed any savings on a short payload |
| Already-compressed content (images, ZIP, video) | No | Binary formats have no exploitable redundancy; double-compression produces a larger result |
| High-CPU embedded / IoT device | No | Compression and decompression are CPU-bound; cycles are more constrained than bandwidth on low-power hardware |
| Intercontinental or mobile link | Yes | Round-trip latency is dominated by propagation delay; smaller payloads finish sooner and reduce per-byte mobile data cost |
How to debug & inspect it
When compression is misbehaving — or you simply want to confirm it's working — curl gives you everything you need. The commands below let you measure actual wire sizes, inspect negotiation headers, and compare compressed vs uncompressed transfer times.
| Symptom | Likely cause | Fix |
|---|---|---|
Server returns uncompressed body despite client sending Accept-Encoding |
Compression not enabled in server or gateway config | Enable gzip/Brotli at the gateway — nginx: gzip on; gzip_types application/json text/plain; |
| Response is larger with compression enabled | Payload is too small (<~150 bytes) or content is already binary | Disable compression for small responses; set gzip_min_length 256 in nginx |
| Client gets garbled response or JSON parse error | Server sent Content-Encoding: gzip but client did not decompress (bug in custom HTTP client) |
Ensure the HTTP client auto-decompresses (most do when you pass the right flag), or manually pipe through gunzip |
| CDN serves wrong encoding to some clients | Missing Vary: Accept-Encoding header — CDN collapses encoding variants into one cached entry |
Add Vary: Accept-Encoding to all compressed responses; purge existing cache entries |
| Brotli not working through a proxy | Proxy strips Accept-Encoding or does not support br token |
Use gzip as the fallback encoding; check proxy Accept-Encoding passthrough configuration |
- Check whether a
Content-Encodingheader is present in the response — usecurl -vor the DevTools Network tab → Response Headers pane. - Verify that the server config has compression enabled and the correct MIME types listed (e.g.,
application/jsonmust be ingzip_types; it is not always included by default). - Measure actual wire size with
curl -o /dev/null -w "%{size_download}"and compare against the uncompressed figure to confirm real savings. - Confirm
Vary: Accept-Encodingis set so CDN caches do not serve a Brotli-encoded response to a client that only supports gzip or plain text. - For Brotli: verify that the reverse proxy (nginx or Caddy) has the Brotli module compiled in — it is not bundled by default in many nginx distributions. Apache requires
mod_brotli.
By the numbers
Compression turns payload size into a bandwidth and latency trade-off. The governing formula:
Scenario: a reporting API returning user analytics. Each JSON response is 12 KB uncompressed. After gzip it compresses to ~3 KB — a 75% reduction, typical for verbose JSON with repeated field names.
Payload size → compressed size trace at 5,000 req/s:
| Payload (raw) | gzip size | Ratio | Bandwidth saved @ 5 k req/s | Compress? |
|---|---|---|---|---|
| 300 bytes (small JSON) | ~310 bytes | bloats | −0.05 MB/s (worse) | No — overhead exceeds savings |
| 800 bytes | ~480 bytes | 40% | +1.6 MB/s saved | Borderline — marginal gain |
| 1.5 KB | ~600 bytes | 60% | +4.5 MB/s saved | Yes |
| 12 KB (scenario) | ~3 KB | 75% | +45 MB/s saved | Yes — clear win |
| 100 KB (large report) | ~18 KB | 82% | +410 MB/s saved | Yes — significant |
| JPEG image (100 KB) | ~100 KB | ~0% | ≈ 0 | No — already binary-compressed |
Worked trace — 12 KB JSON at 5,000 req/s:
Decision math — when to compress:
Sources: nginx gzip module — gzip_min_length; RFC 1952 — GZIP format specification; RFC 7932 — Brotli Compressed Data Format; web.dev — Enable text compression.
🧠 Quick check
1. Why can't a program send a pointer (memory address) directly to another process over the network?
Memory addresses are virtual — they point into a process's own address space. A different process, language runtime, or machine maps memory entirely differently, so the address is meaningless outside its origin process.
2. A client sends Accept-Encoding: gzip, br. The server responds with Content-Encoding: br. What does that mean?
Content-Encoding tells the client how the body is wrapped. Because the client listed br in Accept-Encoding, the server knows the client can decompress Brotli. The client decompresses first, then parses the underlying format (e.g. JSON).
3. Which statement best describes a schema-driven binary format compared to schema-less JSON?
Schema-driven formats (Protobuf, Avro) encode field identities as compact numbers defined in a schema file. This shrinks messages dramatically but introduces a build-time dependency: both sides must share and agree on the same schema.
4. What does the Vary: Accept-Encoding response header instruct HTTP caches to do?
Vary tells caches which request headers affect the response. Accept-Encoding is listed so that a gzip-compressed response is not served to a client that only listed identity (no compression) in its own request.
✍️ Exercise: plan the serialization strategy for a high-traffic event stream
A mobile analytics platform receives 50 000 events per second. Each event has 8 fields: user_id (int), session_id (string, 16 chars), event_type (string, ~10 chars), timestamp (int64), x (float), y (float), screen (string, ~20 chars), app_version (string, ~8 chars). The consumer is an internal data pipeline you control entirely. Design the serialization and compression strategy. Justify each choice.
Model answer:
- Format: Use a schema-driven binary format (Protocol Buffers or Avro). At 50K events/sec, even a 50-byte saving per event eliminates 2.5 MB/s. Field names in JSON (~75 bytes overhead per event) are never worth it for internal, high-throughput pipelines.
- Compression: Batch events into chunks of ~1 000, then apply Snappy or LZ4 (not gzip) — both optimize for speed over compression ratio, which matters when the CPU bottleneck is real. Single-event gzip at 50K/s would be wasteful; batch compression amortizes the overhead.
- Content negotiation: Not applicable for an internal message bus (Kafka/Kinesis). Set the schema ID in the message envelope instead; consumers look up the schema from a registry.
- Schema evolution plan: Reserve field numbers 1–8 for current fields; commit to never renumbering. Add new fields at 9+ with sensible defaults so old consumers ignore them gracefully.
Rubric: ✓ chose binary format with justification ✓ named a fast compression algorithm appropriate for throughput ✓ noted that content negotiation headers don't apply to internal message buses ✓ addressed schema evolution ✓ cited approximate byte savings. Four of five = strong answer.
Key takeaways
- Serialization exists because in-memory objects cannot travel; they must become bytes, cross the wire, and be rebuilt.
- The round trip is always: serialize → transmit → deserialize.
- Schema-less formats (JSON) are self-describing but verbose; schema-driven formats (Protobuf) are compact but require shared schemas.
- Compression (gzip, Brotli) trades CPU cycles for smaller payloads — a good deal on bandwidth-limited links.
Accept/Content-Typenegotiate format;Accept-Encoding/Content-Encodingnegotiate compression — they are independent axes.