API Design

Design Case Studies · Lesson 20

Design: IoT Data Ingestion API

IoT networks feature millions of resource-constrained devices streaming high-frequency sensor readings. The primary bottleneck is the ingestion path: handling millions of write requests per second without saturating server network interfaces or overwhelming the database. We must design a highly scalable, append-only ingestion pipeline using binary protocols, pre-aggregated batches, and optimized time-series storage.

⏱ ~16 min Advanced Prereq: latency-throughput, df-03, pub/sub

By the end you'll be able to

Requirements

Designing an IoT ingestion API requires balancing device battery, network bandwidth, and server IOPS:

Design decisions

Transport Protocols: MQTT vs. CoAP vs. HTTP/2 Batch

For high-frequency, resource-constrained telemetry, standard HTTP/1.1 REST is highly inefficient due to TCP handshake and text header overhead (around 500+ bytes per request). The primary options are:

Data Serialization: JSON vs. Protobuf

To reduce payloads, we avoid verbose JSON text. A typical JSON telemetry payload consumes ~150 bytes. A corresponding Protobuf (Protocol Buffers) payload compiles to just ~24 bytes (an 84% reduction), drastically lowering client transmission energy and data cost.

Time-Series Storage Architecture

Traditional relational tables index rows using B-Trees, which requires updating the index file on disk for every insert. Under high write loads, this causes random disk IOPS exhaustion. We use a Time-Series Database (TSDB) or TimescaleDB hypertables. Hypertables partition data into time-based chunks (e.g., 1-day tables). Since new data always targets the current active partition, the indices fit entirely in RAM, allowing constant-time ($O(1)$) append operations.

The API model

Device Batch Telemetry API (HTTP/2 REST)

For edge gateways and aggregated uploads, we expose a batch endpoint accepting a compressed binary payload containing a sequence of sensor events.

# Upload batched binary Protobuf payload
POST /v1/telemetry/upload HTTP/2
Host: ingest.example.com
Content-Type: application/x-protobuf
Content-Encoding: gzip
X-Device-Signature: hmac-signature-here

[Compressed Protobuf Binary Data]

# Success response
HTTP/2 202 Accepted
Content-Type: application/json

{
  "status": "queued",
  "records_processed": 120
}

Protobuf Payload Definition

This compact schema defines the batch request. By using delta-compression for timestamps, we reduce payload sizes even further.

syntax = "proto3";

message MetricReading {
  uint32 metric_id = 1;      // ID mapping (e.g., 1=temp, 2=humidity)
  double value = 2;          // Numeric sensor value
  sint32 time_offset_ms = 3; // MS delta relative to batch base_timestamp
}

message TelemetryBatch {
  string device_id = 1;
  uint64 base_timestamp_ms = 2; // Absolute Epoch timestamp for reference
  repeated MetricReading readings = 3;
}

Under the hood: Ingestion pipeline

To handle high-throughput writes without database locks, the ingestion path isolates the HTTP/MQTT listeners from the database using a durable message broker.

IoT Devices MQTT / HTTP Protobuf Ingest Gateways Terminates TLS Validates signature Writes to Broker Message Queue Apache Kafka / Pulsar Durable buffer Prevents spikes Consumer Workers Batch micro-flushes (e.g., 5,000 rows/flush) Time-Series DB TimescaleDB / InfluxDB Active partitions in RAM
IoT Ingestion Architecture. Stateless gateways accept telemetry streams and immediately append them to Kafka, completing the client request. Downstream workers drain Kafka, assemble batch inserts, and stream them to partition-active Time-Series DB shards.

By the numbers: telemetry scaling & storage math

Let's calculate the ingestion load and storage footprint for a large-scale telemetry platform.

Governing Equations

Scenario Parameters

Worked Calculations: Network & Storage Footprint

Metric Text Payload (JSON) Binary Payload (Protobuf) TSDB Double-Delta Storage
Write Operations / sec $100,000\text{ ops/s}$ $100,000\text{ ops/s}$ $20\text{ batches/s}$ (Using client-side buffering of 50 readings)
Ingress Network Bandwidth $18.0\text{ MB/s} \approx 144\text{ Mbps}$ $3.2\text{ MB/s} \approx 25.6\text{ Mbps}$ $3.2\text{ MB/s}$ (at gateway)
Daily Ingested Volume $1.55\text{ TB/day}$ $276.4\text{ GB/day}$ 12.96 GB/day
Storage Needed for 1 Year $565.75\text{ TB}$ (Unusable at scale without partitions) $100.88\text{ TB}$ 4.73 TB
🧮 Step-by-step arithmetic: computing the daily storage savings

Let's trace how 1,000,000 devices sending telemetry every 10 seconds compiles on disk:

  1. Compute daily event count: $$Events_{day} = \frac{1,000,000 \text{ devices}}{10 \text{ seconds}} \times 86,400 \text{ seconds/day} = 8,640,000,000 \text{ events/day}$$
  2. Compute raw JSON size (180 bytes/event): $$Size_{JSON} = 8.64 \times 10^9 \text{ events} \times 180 \text{ bytes} = 1.5552 \times 10^{12} \text{ bytes} \approx 1.55 \text{ TB/day}$$
  3. Compute Protobuf payload size (32 bytes/event): $$Size_{Proto} = 8.64 \times 10^9 \text{ events} \times 32 \text{ bytes} = 2.7648 \times 10^{11} \text{ bytes} \approx 276.4 \text{ GB/day}$$
  4. Compute TSDB storage with Gorilla Double-Delta compression (1.5 bytes/event): $$Size_{TSDB} = 8.64 \times 10^9 \text{ events} \times 1.5 \text{ bytes} = 1.296 \times 10^{10} \text{ bytes} \approx 12.96 \text{ GB/day}$$

By moving from raw JSON storage to binary protobuf ingestion and compressing time series data, daily storage demands fall from **1.55 TB to 12.96 GB** — a **99.1% reduction in storage cost**.

How to debug & inspect it

To debug IoT ingestion pipelines, monitor the message broker lag (to ensure consumers are keeping pace with device writes) and inspect binary payloads directly.

# 1. Check Kafka Consumer Group Lag for telemetry consumers $ kafka-consumer-groups.sh --bootstrap-server kafka:9092 --describe --group telemetry-inserters GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID telemetry-inserters device-readings 0 84210452 84210492 40 worker-0-abc telemetry-inserters device-readings 1 84192080 84594200 402120 worker-1-xyz # Note: Partition 1 shows high LAG (402,120 messages). Worker 1 is struggling or crashed! # 2. Inspect a raw binary protobuf payload received at the edge using protoc $ tcpdump -i eth0 -w - port 50051 | protoc --decode=TelemetryBatch telemetry.proto device_id: "dev_sensor_9921" base_timestamp_ms: 1718908412000 readings { metric_id: 1 value: 23.85 time_offset_ms: 0 } readings { metric_id: 2 value: 62.4 time_offset_ms: 5000 }

Use the guide below to address common failures in high-throughput append-only pipelines:

Symptom Likely Cause Fix
The ingest gateway returns HTTP 503 Service Unavailable under sudden device reconnect spikes Gateway threads are blocked waiting for writes to complete on the message broker Ensure gateway writes to Kafka are asynchronous; allocate local memory buffers on gateways to spool events.
Database write IOPS spikes to 100% capacity; insertion speed plummets Hypertables/Partitions are too large, forcing index updates to hit disk swap space instead of RAM Decrease partition time-slice size (e.g. from 7 days to 1 day) to ensure the index for the active write range fits in RAM.
High data billing charges on device cellular plans HTTP header overhead and verbose JSON formatting are consuming excessive bytes per report Transition from JSON-over-HTTP to CoAP-over-UDP or Protobuf-over-MQTT; introduce client-side batching.

🧠 Quick check

1. Why are traditional relational database B-Tree indices unsuitable for direct high-throughput IoT writes?

For large datasets, B-Tree indexes exceed memory limits. Inserting new records with random or non-sequential keys forces frequent disk reads/writes to update index pages, saturating disk IOPS.

2. How does a Time-Series Database (TSDB) partition strategy maintain constant-time ($O(1)$) write speeds?

By creating partition chunks based on timestamp ranges (e.g. daily), all new writes target the newest chunk. The index for this active partition is small enough to remain in memory, avoiding disk seeks.

3. Why is CoAP (over UDP) preferred over MQTT (over TCP) for extremely battery-constrained sensors?

TCP requires handshakes (SYN, SYN-ACK) and keep-alive packets to maintain a connection state. For sensors that sleep and wake up occasionally to send 20 bytes of data, UDP avoids connection overhead and conserves battery.

4. What is the role of a message queue (like Kafka) in the ingestion pipeline?

A message queue decouples the high-velocity ingestion layer from the database write layer, absorbing incoming traffic spikes and letting background consumers write data to the DB at a steady, manageable rate.

✍️ Exercise: design the rollup schema

Your TSDB stores 10 billion raw temperature readings ($12.96\text{ GB/day}$). The data retention policy specifies that raw data is deleted after 30 days. However, analysts need long-term temperature trends over a 3-year period.

Design a rollup query or system and the corresponding target table schema to store aggregated metrics. What aggregation window should you use, and what fields must the rollup table contain?


Model answer:

To retain long-term trends while purging raw data, we perform hourly or daily rollups. For long-term trend analysis, an hourly rollup window reduces the data volume by 360× (from 10s intervals to 1h intervals) while preserving temperature dynamics.

Rollup Table Schema (SQL):

CREATE TABLE metric_hourly_rollups (
    device_id VARCHAR(64) NOT NULL,
    metric_id INT NOT NULL,
    hour_timestamp TIMESTAMP NOT NULL,
    reading_count INT NOT NULL,     -- Tracks number of raw events rolled up
    min_value DOUBLE PRECISION,     -- Minimum value in the hour
    max_value DOUBLE PRECISION,     -- Maximum value in the hour
    avg_value DOUBLE PRECISION,     -- Arithmetic mean
    PRIMARY KEY (device_id, metric_id, hour_timestamp)
);

Key Design requirements:

  1. Separate metrics: Avoid storing raw points. Roll up values using `min`, `max`, and `avg`.
  2. Keep counts: Storing `reading_count` is essential if you ever need to combine multiple hourly rollups into a daily rollup (to calculate a weighted average: $\frac{\sum (\text{avg} \times \text{count})}{\sum \text{count}}$).
  3. Job scheduling: A cron task or continuous background trigger (e.g. TimescaleDB continuous aggregates) runs this query hourly:
    INSERT INTO metric_hourly_rollups
    SELECT device_id, metric_id, date_trunc('hour', time) as hour_timestamp,
           COUNT(*), MIN(value), MAX(value), AVG(value)
    FROM raw_device_readings
    WHERE time >= NOW() - INTERVAL '2 hours' AND time < NOW() - INTERVAL '1 hour'
    GROUP BY device_id, metric_id, hour_timestamp;

Key takeaways

Sources & further reading