Data & Formats · Lesson 03

Binary formats: Protocol Buffers & Avro

When a system processes millions of messages per second, the overhead of printing and parsing field names in JSON stops being a convenience and starts being a bottleneck. Binary formats eliminate that overhead — but they demand a different way of thinking about schemas.

⏱ 13 min Difficulty: advanced Prereq: df-01 — Data representation, df-02 — Textual formats

By the end you'll be able to

Explain why binary formats are smaller and faster to parse than text formats, using field numbers as the core mechanism.
Write a minimal .proto message definition and reason about safe schema evolution rules.
Choose between JSON and a binary format given a specific scenario, justifying the trade-off.

Why binary: the tax on human readability

Every JSON message carries a metadata tax you never think about: the field names. In {"latitude": 37.7749, "longitude": -122.4194}, the values are 15 characters; the field names are 21. For a single response it doesn't matter. For a sensor that broadcasts its position 500 times per second to 10 000 subscribers, those 21 characters per message add up to over 100 MB/min of pure overhead — bytes that carry zero information once the schema is agreed.

Binary formats eliminate the tax by replacing field names with small integers. Both sides agree on a schema — a shared document that says "field number 1 is latitude, field number 2 is longitude" — and the wire carries only the numbers, not the names. A three-byte field tag instead of a ten-byte string; a four-byte float instead of its ten-character text representation.

The trade-off is direct: binary messages are not human-readable. You cannot paste them into a terminal and understand them. You always need the schema, and usually a code generator, to work with them. For public developer APIs where debugging experience matters, that cost is often too high. For internal, high-throughput infrastructure — event pipelines, gRPC microservices, mobile push — it's the right call.

Approximate on-wire sizes for a two-field location message. Protobuf and Avro are 3–4× smaller than minified JSON for this shape — savings that compound at scale.

Protocol Buffers: field numbers as the contract

Protocol Buffers (Protobuf), developed at Google, centres everything on a .proto file. This schema assigns each field a permanent integer — the field number. The generated code turns that file into strongly-typed classes in Go, Java, Python, etc. The wire format carries field numbers, never names.

// location.proto — a minimal Protobuf schema
syntax = "proto3";
package telemetry;

message Location {
  double latitude  = 1;   // field number 1 — permanent
  double longitude = 2;   // field number 2 — permanent
  int64  timestamp = 3;   // Unix ms; field number 3
}

// Safe additions — new fields get new numbers
message LocationV2 {
  double latitude  = 1;
  double longitude = 2;
  int64  timestamp = 3;
  float  altitude  = 4;   // added in v2 — old consumers ignore it
  string sensor_id = 5;   // added in v2 — old consumers ignore it
}

The code generator (invoked as protoc) reads this file and emits serialization/deserialization code for your target language. Serializing a Location in Go is a method call; the developer never writes a loop over bytes.

Avro: schema stored alongside data

Apache Avro takes a different angle: the schema is written in JSON and is often stored alongside or embedded in the data file. In the Kafka ecosystem, Avro schemas are registered in a Schema Registry — a shared catalogue that assigns each schema an integer ID. Messages carry only that small ID; consumers look up the full schema from the registry before decoding.

Avro's headline feature is schema resolution: when a producer and consumer have slightly different versions of a schema, Avro can map old fields to new fields by name — useful in event-sourcing or data lake scenarios where you can't atomically redeploy all producers and consumers at the same time.

// Avro schema (JSON notation)
{
  "type":   "record",
  "name":   "Location",
  "fields": [
    { "name": "latitude",  "type": "double" },
    { "name": "longitude", "type": "double" },
    { "name": "timestamp", "type": "long"   }
  ]
}

// Adding a field safely: supply a default so old consumers
// can decode records that lack the new field
{ "name": "altitude", "type": ["null", "float"], "default": null }

When binary beats JSON: the gRPC connection

Google's gRPC framework pairs HTTP/2 with Protobuf. The combination gives you multiplexed streams over a single TCP connection, strongly-typed IDL-generated clients, bidirectional streaming, and binary payloads — all in one package. gRPC is the standard wire protocol for internal microservice communication at companies operating at Google / Netflix scale.

The rule of thumb: choose binary when all three of the following are true — (1) both endpoints are internal services you control; (2) throughput or latency is genuinely constrained; (3) you have a CI pipeline that can compile and distribute schema changes. If any condition is false, lean JSON until proven otherwise.

Factor	Favour JSON	Favour Binary (Protobuf/Avro)
Audience	External developers, public API	Internal services you control
Throughput	Low–medium (<10K req/s)	High (>100K msg/s)
Debuggability	High priority	Tooling acceptable
Schema enforcement	Nice-to-have	Required
Language diversity	Many, uncontrolled	Fixed set, code-gen available

⚠️ Common trap: reusing or renumbering field numbers

The single most destructive Protobuf mistake is deleting a field and then assigning its number to a new field with a different type or meaning. Old messages sitting in a queue or storage layer still carry the original field number; they will be decoded as the wrong field. The rules are absolute: never reuse a field number and never change a field's wire type. When you delete a field, mark it reserved in the .proto file so the compiler rejects any future reuse. Schema evolution in Avro requires similarly careful management of defaults: adding a field without a default breaks all existing records.

🎯 Interview angle

"When would you choose a binary format over JSON?" Frame the answer around the constraints, not a preference. Binary wins when: throughput or latency budgets are tight, both producer and consumer are services you deploy (no third-party consumers), and you can afford the schema management overhead. JSON wins everywhere else because debugging a bad deployment at 2 AM is much easier when you can read the wire with curl. Naming gRPC + Protobuf as the canonical production example, plus Avro for Kafka pipelines, rounds out a senior answer.

✅ Start with JSON; migrate with measurement

Premature optimisation applies to serialization formats too. Start with JSON, measure actual payload sizes and parse latency, and migrate to binary only when data proves a bottleneck. When you do migrate, keep a JSON fallback path during the cutover: binary and JSON decoders can coexist behind a version header, letting you roll back without a full redeployment.

Under the hood: how it actually works

Every field on the wire starts with a tag byte computed by the formula tag = (field_number << 3) | wire_type. Wire types are small integers: 0 = varint, 1 = 64-bit fixed, 2 = length-delimited (string / bytes / embedded message), 5 = 32-bit fixed. So field 1 carrying a varint has tag byte 0x08 = (1 << 3) | 0. Field 2 carrying a varint has tag 0x10 = (2 << 3) | 0. Field 3 carrying a length-delimited string has tag 0x1a = (3 << 3) | 2. The decoder reads the tag, extracts the field number and wire type, and knows exactly how to read the bytes that follow — without needing the field name at all.

Values themselves use varint encoding for integers: small numbers fit in fewer bytes. Values 0–127 fit in a single byte. Values 128–16383 fit in two bytes. The most-significant bit (MSB) of each byte is a continuation bit: 1 means more bytes follow; 0 means this is the final byte of the integer. The integer 1 therefore encodes as 0x01 (MSB is 0 — done). The integer 300 encodes as 0xac 0x02: first byte 0xac has MSB 1 (more follows) and carries the low 7 bits; second byte 0x02 has MSB 0 and carries the remaining high bits.

// .proto definition
message User {
  int32  id   = 1;    // field 1, wire type 0 (varint)
  string name = 2;    // field 2, wire type 2 (length-delimited)
}

// Encoding User { id: 42, name: "Ada" }
// Field 1 (id=42):
//   tag = (1 << 3) | 0 = 0x08
//   value 42 = 0x2a (varint, fits in 1 byte, MSB=0)
//   bytes: 08 2a

// Field 2 (name="Ada"):
//   tag = (2 << 3) | 2 = 0x12
//   length = 3 (0x03)
//   UTF-8 bytes of "Ada" = 41 64 61
//   bytes: 12 03 41 64 61

// Full message: 7 bytes total
08 2a  12 03 41 64 61
// ^tag ^42  ^tag  ^len ^"Ada"

// JSON equivalent: {"id":42,"name":"Ada"} = 20 bytes (minified)
// Binary is 65% smaller for this tiny message

The compactness advantage is structural, not incidental. In JSON, the key "id" costs 4 bytes (quotes + letters + colon), and "name" costs 7 bytes — 11 bytes of overhead just for two field names in a 20-byte message. Protobuf uses 2 tag bytes total for those same two fields. Scale that up: a message with 10 fields and average key length 8 bytes carries roughly 80 bytes of JSON key-plus-colon overhead, replaced by roughly 10 tag bytes in Protobuf. Before even accounting for the more efficient numeric encoding (4-byte float vs. its 12-character text form), a typical API message is already 3–5× smaller on the wire.

The 7-byte wire encoding of User{id: 42, name: "Ada"}. Teal boxes are tag bytes; each encodes a field number and wire type in a single byte. The equivalent minified JSON is 20 bytes.

How to debug & inspect it

Because Protobuf is binary and opaque to the naked eye, debugging requires dedicated tooling. The protoc --decode_raw command decodes any Protobuf binary blob without the schema — it shows field numbers and wire types, but not field names. When you do have the .proto file, protoc --decode=package.MessageName produces a full human-readable text representation with field names. For gRPC specifically, grpcurl lets you call a live endpoint and read the JSON-decoded response without writing any client code.

$ echo "082a120341 6461" | xxd -r -p | protoc --decode_raw # --decode_raw: no schema needed — shows field numbers only 1: 42 2: "Ada" # Or pipe a binary file directly: $ cat response.bin | protoc --decode_raw # With the .proto schema available (full field names): $ cat response.bin | protoc --decode=user.User user.proto id: 42 name: "Ada" # grpcurl: call a gRPC endpoint and see the JSON-decoded response $ grpcurl -d '{"id": 1}' api.example.com:443 user.UserService/GetUser { "id": 1, "name": "Ada Lovelace" }

For web-based inspection without installing local tools, use protoscope (a Protobuf hex dumper available as a Go binary) or the online Protobuf Decoder — paste hex bytes and it returns a field-by-field breakdown with wire types.

Symptom	Cause	Fix
Decoded field value is garbage / wrong type	Field number was reused with a different wire type after a schema change	Never reuse field numbers; mark deleted fields `reserved`; roll back and re-encode affected messages
Consumer silently reads zero / default for a required field	Producer sends the field, but consumer's schema has a different field number for that field name — numbering mismatch between teams	Enforce schema review in CI; use a schema registry; run `protoc --decode` on a sample message to compare expected vs. actual field numbers
Message decodes but `altitude` field is always 0	New optional field (e.g., `altitude = 4`) not sent by old producer; consumer reads the proto3 default value	Expected behaviour; proto3 defaults are by design. Add a sentinel value or a `has_altitude` bool field if you need to distinguish "not set" from "zero"
Large integers wrap or truncate unexpectedly	`int32` field overflows (max 2³¹−1); should be `int64`	Change the field type to `int64` using a new field number — avoid wire-format collision with cached old messages
Schema evolution breaks existing Kafka messages	New required field added without a default in Avro, or Protobuf field number renumbered	Add fields with defaults (Avro) or new field numbers (Protobuf); never modify existing field semantics

Debug checklist:

Use protoc --decode_raw on a captured binary blob to confirm the field numbers match your .proto definition before assuming a parsing bug.
When adding a field, check the .proto for reserved declarations — if the number you want is reserved, pick the next available number.
Run protoc --lint (or buf lint) in CI to catch naming and numbering violations before they reach production.
For Avro: after adding a field, test schema compatibility with avro-tools or the Confluent Schema Registry compatibility check API before deploying the new producer.
Keep a changelog of field number allocations in the .proto file as comments — future engineers need to know which numbers are reserved and why.

🧠 Quick check

1. What does a Protobuf field number on the wire replace?

Field numbers encode field identity on the wire instead of the full string name. The name exists only in the .proto file (and generated code). This is the primary source of Protobuf's size advantage over JSON.

2. A team deletes field number 3 from a Protobuf message and reassigns that number to a new field with a different type. What is the consequence?

Protobuf field numbers are permanent identifiers. Reusing a number for a different type or meaning causes any legacy record that still carries that number to be misinterpreted. The safe path is to mark the old number as reserved and use a new, never-before-used number for the new field.

3. You are building a public REST API for third-party mobile developers. Should you use Protobuf or JSON as the default format, and why?

For a public API, developer experience and broad client compatibility are paramount. JSON requires no special tooling. Protobuf's benefits accrue on internal, high-throughput services where you control both ends. Avro is primarily a data serialization format for pipelines (Kafka), not HTTP APIs.

4. In an Avro schema, what is the purpose of supplying a "default" value when adding a new field?

Avro's schema resolution rule: a new optional field with a default can be read from records written under the old schema. Without a default, the field is required, and decoding an old record that lacks it fails. Defaults are the primary tool for backward-compatible schema evolution in Avro.

✍️ Exercise: evolve a Protobuf schema safely

You have the following production Protobuf schema at version 1:

message Order {
  int64  order_id     = 1;
  string customer_id  = 2;
  double total_cents  = 3;  // BUG: should be int64
}

You need to: (a) fix the total_cents type to int64, (b) add a currency_code field (e.g., "USD"), and (c) remove the buggy double field without breaking old consumers who still send it. Write the v2 schema with correct field numbers and reserved declarations.

Model answer:

message Order {
  int64  order_id      = 1;
  string customer_id   = 2;
  // Field 3 (total_cents double) is REMOVED — mark reserved
  reserved               3;
  reserved               "total_cents";  // prevents name reuse
  int64  total_cents_v2 = 4;  // correct type, new number
  string currency_code  = 5;  // new field — old consumers ignore
}

Why this works: Old producers still send field 3 (the buggy double); new consumers receive it but discard it because 3 is not defined in the new message. New producers send fields 1, 2, 4, 5. Old consumers that only know about 1–3 ignore 4 and 5. During migration, the application layer reads total_cents_v2 when non-zero, falling back to a conversion from the deprecated total_cents double.

Rubric: ✓ did not reuse field 3 ✓ added reserved for both the number and the name ✓ used a new number (4) for the fixed field ✓ new field (5) gets a new number ✓ explained the migration path. All five = exceptional; three or four = passing.

Key takeaways

Binary formats replace verbose field names with compact integer tags, yielding 3–10× smaller messages for typical API shapes.
Protobuf field numbers are permanent: never reuse or renumber them; mark deleted fields reserved.
Avro uses schema resolution and defaults to handle version mismatches between producers and consumers.
Binary is best for internal, high-throughput services (gRPC, Kafka) where you control both ends and can manage schema changes.
JSON remains the right default for public APIs — readability and zero-tooling onboarding outweigh the size advantage for most cases.