Design Case Studies · Lesson 20

Design: IoT Data Ingestion API

IoT networks feature millions of resource-constrained devices streaming high-frequency sensor readings. The primary bottleneck is the ingestion path: handling millions of write requests per second without saturating server network interfaces or overwhelming the database. We must design a highly scalable, append-only ingestion pipeline using binary protocols, pre-aggregated batches, and optimized time-series storage.

⏱ ~16 min Advanced Prereq: latency-throughput, df-03, pub/sub

By the end you'll be able to

Choose between HTTP batch REST endpoints and streaming protocols (MQTT/CoAP) for IoT ingestion.
Design compact Protobuf schemas for sensor payloads to minimize device transmission overhead.
Compute raw daily ingress bandwidth and calculate storage savings from double-delta compression.
Architect a buffered write path using message queues and time-series partitioning (TimescaleDB/InfluxDB).

Requirements

Designing an IoT ingestion API requires balancing device battery, network bandwidth, and server IOPS:

Ingestion Scale. Support $1,000,000$ active devices transmitting sensor logs.
Write Path Durability. Ingested data must be reliably committed to disk; however, minor telemetry loss (e.g., 0.01% of temp readings) is acceptable if it prevents complete pipeline lockup.
Bandwidth Constraints. Devices operate on cellular or satellite links (pay-per-byte billing); payloads must be as small as possible.
Database Efficiency. Relational index updates on every insert ($O(\log N)$ b-tree operations) will fail at $100,000+$ writes/sec. The database must support append-only high-rate writes.

Design decisions

Transport Protocols: MQTT vs. CoAP vs. HTTP/2 Batch

For high-frequency, resource-constrained telemetry, standard HTTP/1.1 REST is highly inefficient due to TCP handshake and text header overhead (around 500+ bytes per request). The primary options are:

MQTT (Message Queuing Telemetry Transport). Runs over TCP. Lightweight publish/subscribe model. Best when devices maintain persistent connections and require bidirectional command-and-control messaging.
CoAP (Constrained Application Protocol). Runs over UDP. Mimics REST semantics but uses a tiny binary header. Ideal for extremely battery-constrained or memory-constrained microcontrollers.
HTTP/2 or HTTP/3 REST with Batching. Runs over multiplexed TCP/QUIC. Ideal for gateways or smarter edge devices that can buffer readings locally and upload them in batches (e.g. once every 60 seconds) over TLS.

Data Serialization: JSON vs. Protobuf

To reduce payloads, we avoid verbose JSON text. A typical JSON telemetry payload consumes ~150 bytes. A corresponding Protobuf (Protocol Buffers) payload compiles to just ~24 bytes (an 84% reduction), drastically lowering client transmission energy and data cost.

Time-Series Storage Architecture

Traditional relational tables index rows using B-Trees, which requires updating the index file on disk for every insert. Under high write loads, this causes random disk IOPS exhaustion. We use a Time-Series Database (TSDB) or TimescaleDB hypertables. Hypertables partition data into time-based chunks (e.g., 1-day tables). Since new data always targets the current active partition, the indices fit entirely in RAM, allowing constant-time ($O(1)$) append operations.

The API model

Device Batch Telemetry API (HTTP/2 REST)

For edge gateways and aggregated uploads, we expose a batch endpoint accepting a compressed binary payload containing a sequence of sensor events.

# Upload batched binary Protobuf payload
POST /v1/telemetry/upload HTTP/2
Host: ingest.example.com
Content-Type: application/x-protobuf
Content-Encoding: gzip
X-Device-Signature: hmac-signature-here

[Compressed Protobuf Binary Data]

# Success response
HTTP/2 202 Accepted
Content-Type: application/json

{
  "status": "queued",
  "records_processed": 120
}

Protobuf Payload Definition

This compact schema defines the batch request. By using delta-compression for timestamps, we reduce payload sizes even further.

syntax = "proto3";

message MetricReading {
  uint32 metric_id = 1;      // ID mapping (e.g., 1=temp, 2=humidity)
  double value = 2;          // Numeric sensor value
  sint32 time_offset_ms = 3; // MS delta relative to batch base_timestamp
}

message TelemetryBatch {
  string device_id = 1;
  uint64 base_timestamp_ms = 2; // Absolute Epoch timestamp for reference
  repeated MetricReading readings = 3;
}

Under the hood: Ingestion pipeline

To handle high-throughput writes without database locks, the ingestion path isolates the HTTP/MQTT listeners from the database using a durable message broker.

IoT Ingestion Architecture. Stateless gateways accept telemetry streams and immediately append them to Kafka, completing the client request. Downstream workers drain Kafka, assemble batch inserts, and stream them to partition-active Time-Series DB shards.

By the numbers: telemetry scaling & storage math

Let's calculate the ingestion load and storage footprint for a large-scale telemetry platform.

Governing Equations

Total Raw Inbound Event Rate: $$Event_{rate} = \frac{N_{devices} \times F_{readings}}{T_{interval}}$$ Where $N_{devices}$ is the number of devices, $F_{readings}$ is the readings per device, and $T_{interval}$ is the interval in seconds.
Daily Raw Data Size (Uncompressed): $$Size_{raw} = Event_{rate} \times Payload_{bytes} \times 86,400\text{ sec/day}$$
Double-Delta TSDB Compression: High-density time-series data compresses exceptionally well using Gorilla compression (delta-of-delta for timestamps, XOR for float values). We achieve an average compression factor of **10× to 35×** compared to text-based JSON.

Scenario Parameters

Active Devices ($N$): 1,000,000
Transmission Interval ($T$): 10 seconds (1 reading per device every 10s)
Average JSON Text Payload size: 180 bytes
Equivalent Binary Protobuf size: 32 bytes
Average TSDB Compressed record size on disk: 1.5 bytes

Worked Calculations: Network & Storage Footprint

Metric	Text Payload (JSON)	Binary Payload (Protobuf)	TSDB Double-Delta Storage
Write Operations / sec	$100,000\text{ ops/s}$	$100,000\text{ ops/s}$	$20\text{ batches/s}$ (Using client-side buffering of 50 readings)
Ingress Network Bandwidth	$18.0\text{ MB/s} \approx 144\text{ Mbps}$	$3.2\text{ MB/s} \approx 25.6\text{ Mbps}$	$3.2\text{ MB/s}$ (at gateway)
Daily Ingested Volume	$1.55\text{ TB/day}$	$276.4\text{ GB/day}$	12.96 GB/day
Storage Needed for 1 Year	$565.75\text{ TB}$ (Unusable at scale without partitions)	$100.88\text{ TB}$	4.73 TB

🧮 Step-by-step arithmetic: computing the daily storage savings

Let's trace how 1,000,000 devices sending telemetry every 10 seconds compiles on disk:

Compute daily event count: $$Events_{day} = \frac{1,000,000 \text{ devices}}{10 \text{ seconds}} \times 86,400 \text{ seconds/day} = 8,640,000,000 \text{ events/day}$$
Compute raw JSON size (180 bytes/event): $$Size_{JSON} = 8.64 \times 10^9 \text{ events} \times 180 \text{ bytes} = 1.5552 \times 10^{12} \text{ bytes} \approx 1.55 \text{ TB/day}$$
Compute Protobuf payload size (32 bytes/event): $$Size_{Proto} = 8.64 \times 10^9 \text{ events} \times 32 \text{ bytes} = 2.7648 \times 10^{11} \text{ bytes} \approx 276.4 \text{ GB/day}$$
Compute TSDB storage with Gorilla Double-Delta compression (1.5 bytes/event): $$Size_{TSDB} = 8.64 \times 10^9 \text{ events} \times 1.5 \text{ bytes} = 1.296 \times 10^{10} \text{ bytes} \approx 12.96 \text{ GB/day}$$

By moving from raw JSON storage to binary protobuf ingestion and compressing time series data, daily storage demands fall from **1.55 TB to 12.96 GB** — a **99.1% reduction in storage cost**.

How to debug & inspect it

To debug IoT ingestion pipelines, monitor the message broker lag (to ensure consumers are keeping pace with device writes) and inspect binary payloads directly.

# 1. Check Kafka Consumer Group Lag for telemetry consumers $ kafka-consumer-groups.sh --bootstrap-server kafka:9092 --describe --group telemetry-inserters GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID telemetry-inserters device-readings 0 84210452 84210492 40 worker-0-abc telemetry-inserters device-readings 1 84192080 84594200 402120 worker-1-xyz # Note: Partition 1 shows high LAG (402,120 messages). Worker 1 is struggling or crashed! # 2. Inspect a raw binary protobuf payload received at the edge using protoc $ tcpdump -i eth0 -w - port 50051 | protoc --decode=TelemetryBatch telemetry.proto device_id: "dev_sensor_9921" base_timestamp_ms: 1718908412000 readings { metric_id: 1 value: 23.85 time_offset_ms: 0 } readings { metric_id: 2 value: 62.4 time_offset_ms: 5000 }

Use the guide below to address common failures in high-throughput append-only pipelines:

Symptom	Likely Cause	Fix
The ingest gateway returns `HTTP 503 Service Unavailable` under sudden device reconnect spikes	Gateway threads are blocked waiting for writes to complete on the message broker	Ensure gateway writes to Kafka are asynchronous; allocate local memory buffers on gateways to spool events.
Database write IOPS spikes to 100% capacity; insertion speed plummets	Hypertables/Partitions are too large, forcing index updates to hit disk swap space instead of RAM	Decrease partition time-slice size (e.g. from 7 days to 1 day) to ensure the index for the active write range fits in RAM.
High data billing charges on device cellular plans	HTTP header overhead and verbose JSON formatting are consuming excessive bytes per report	Transition from JSON-over-HTTP to CoAP-over-UDP or Protobuf-over-MQTT; introduce client-side batching.

🧠 Quick check

1. Why are traditional relational database B-Tree indices unsuitable for direct high-throughput IoT writes?

For large datasets, B-Tree indexes exceed memory limits. Inserting new records with random or non-sequential keys forces frequent disk reads/writes to update index pages, saturating disk IOPS.

2. How does a Time-Series Database (TSDB) partition strategy maintain constant-time ($O(1)$) write speeds?

By creating partition chunks based on timestamp ranges (e.g. daily), all new writes target the newest chunk. The index for this active partition is small enough to remain in memory, avoiding disk seeks.

3. Why is CoAP (over UDP) preferred over MQTT (over TCP) for extremely battery-constrained sensors?

TCP requires handshakes (SYN, SYN-ACK) and keep-alive packets to maintain a connection state. For sensors that sleep and wake up occasionally to send 20 bytes of data, UDP avoids connection overhead and conserves battery.

4. What is the role of a message queue (like Kafka) in the ingestion pipeline?

A message queue decouples the high-velocity ingestion layer from the database write layer, absorbing incoming traffic spikes and letting background consumers write data to the DB at a steady, manageable rate.

✍️ Exercise: design the rollup schema

Your TSDB stores 10 billion raw temperature readings ($12.96\text{ GB/day}$). The data retention policy specifies that raw data is deleted after 30 days. However, analysts need long-term temperature trends over a 3-year period.

Design a rollup query or system and the corresponding target table schema to store aggregated metrics. What aggregation window should you use, and what fields must the rollup table contain?

Model answer:

To retain long-term trends while purging raw data, we perform hourly or daily rollups. For long-term trend analysis, an hourly rollup window reduces the data volume by 360× (from 10s intervals to 1h intervals) while preserving temperature dynamics.

Rollup Table Schema (SQL):

CREATE TABLE metric_hourly_rollups (
    device_id VARCHAR(64) NOT NULL,
    metric_id INT NOT NULL,
    hour_timestamp TIMESTAMP NOT NULL,
    reading_count INT NOT NULL,     -- Tracks number of raw events rolled up
    min_value DOUBLE PRECISION,     -- Minimum value in the hour
    max_value DOUBLE PRECISION,     -- Maximum value in the hour
    avg_value DOUBLE PRECISION,     -- Arithmetic mean
    PRIMARY KEY (device_id, metric_id, hour_timestamp)
);

Key Design requirements:

Separate metrics: Avoid storing raw points. Roll up values using `min`, `max`, and `avg`.
Keep counts: Storing `reading_count` is essential if you ever need to combine multiple hourly rollups into a daily rollup (to calculate a weighted average: $\frac{\sum (\text{avg} \times \text{count})}{\sum \text{count}}$).

Job scheduling: A cron task or continuous background trigger (e.g. TimescaleDB continuous aggregates) runs this query hourly:

INSERT INTO metric_hourly_rollups
SELECT device_id, metric_id, date_trunc('hour', time) as hour_timestamp,
       COUNT(*), MIN(value), MAX(value), AVG(value)
FROM raw_device_readings
WHERE time >= NOW() - INTERVAL '2 hours' AND time < NOW() - INTERVAL '1 hour'
GROUP BY device_id, metric_id, hour_timestamp;

Key takeaways

**Telemetry APIs are write-heavy**. Avoid verbose HTTP structures and JSON text formats; use binary protocols (MQTT/CoAP) and Protobuf payload structures.
**Decouple writes with a broker**. Run stateless edge ingest gateways that write directly to a message queue (Kafka) to guarantee sub-10ms write responses.
**Index in memory**. B-Tree indexes on large tables fail at scale. Use a Time-Series Database (TSDB) that partitions tables into time chunks so indices fit in RAM.
**Incorporate delta-compression**. TSDB storage engines use delta-of-delta compression (Gorilla) to reduce numeric telemetry storage by 90%+.
**Implement Rollup Retention**. Purge high-resolution raw data after a short window (e.g., 30 days) and keep consolidated summaries (hourly averages, min/max) for long-term analytics.

Sources & further reading

Gorilla Paper (Facebook) — A Fast, In-Memory Time Series Database — describes double-delta timestamp compression and XOR floating point compression
TimescaleDB Architecture — Hypertables & Chunking — explanation of memory-resident index scaling for time-series
MQTT Specification & Protocols — explaining the light-overhead design parameters for constrained networks