Design Case Studies · Lesson 20
Design: IoT Data Ingestion API
IoT networks feature millions of resource-constrained devices streaming high-frequency sensor readings. The primary bottleneck is the ingestion path: handling millions of write requests per second without saturating server network interfaces or overwhelming the database. We must design a highly scalable, append-only ingestion pipeline using binary protocols, pre-aggregated batches, and optimized time-series storage.
By the end you'll be able to
- Choose between HTTP batch REST endpoints and streaming protocols (MQTT/CoAP) for IoT ingestion.
- Design compact Protobuf schemas for sensor payloads to minimize device transmission overhead.
- Compute raw daily ingress bandwidth and calculate storage savings from double-delta compression.
- Architect a buffered write path using message queues and time-series partitioning (TimescaleDB/InfluxDB).
Requirements
Designing an IoT ingestion API requires balancing device battery, network bandwidth, and server IOPS:
- Ingestion Scale. Support $1,000,000$ active devices transmitting sensor logs.
- Write Path Durability. Ingested data must be reliably committed to disk; however, minor telemetry loss (e.g., 0.01% of temp readings) is acceptable if it prevents complete pipeline lockup.
- Bandwidth Constraints. Devices operate on cellular or satellite links (pay-per-byte billing); payloads must be as small as possible.
- Database Efficiency. Relational index updates on every insert ($O(\log N)$ b-tree operations) will fail at $100,000+$ writes/sec. The database must support append-only high-rate writes.
Design decisions
Transport Protocols: MQTT vs. CoAP vs. HTTP/2 Batch
For high-frequency, resource-constrained telemetry, standard HTTP/1.1 REST is highly inefficient due to TCP handshake and text header overhead (around 500+ bytes per request). The primary options are:
- MQTT (Message Queuing Telemetry Transport). Runs over TCP. Lightweight publish/subscribe model. Best when devices maintain persistent connections and require bidirectional command-and-control messaging.
- CoAP (Constrained Application Protocol). Runs over UDP. Mimics REST semantics but uses a tiny binary header. Ideal for extremely battery-constrained or memory-constrained microcontrollers.
- HTTP/2 or HTTP/3 REST with Batching. Runs over multiplexed TCP/QUIC. Ideal for gateways or smarter edge devices that can buffer readings locally and upload them in batches (e.g. once every 60 seconds) over TLS.
Data Serialization: JSON vs. Protobuf
To reduce payloads, we avoid verbose JSON text. A typical JSON telemetry payload consumes ~150 bytes. A corresponding Protobuf (Protocol Buffers) payload compiles to just ~24 bytes (an 84% reduction), drastically lowering client transmission energy and data cost.
Time-Series Storage Architecture
Traditional relational tables index rows using B-Trees, which requires updating the index file on disk for every insert. Under high write loads, this causes random disk IOPS exhaustion. We use a Time-Series Database (TSDB) or TimescaleDB hypertables. Hypertables partition data into time-based chunks (e.g., 1-day tables). Since new data always targets the current active partition, the indices fit entirely in RAM, allowing constant-time ($O(1)$) append operations.
The API model
Device Batch Telemetry API (HTTP/2 REST)
For edge gateways and aggregated uploads, we expose a batch endpoint accepting a compressed binary payload containing a sequence of sensor events.
# Upload batched binary Protobuf payload
POST /v1/telemetry/upload HTTP/2
Host: ingest.example.com
Content-Type: application/x-protobuf
Content-Encoding: gzip
X-Device-Signature: hmac-signature-here
[Compressed Protobuf Binary Data]
# Success response
HTTP/2 202 Accepted
Content-Type: application/json
{
"status": "queued",
"records_processed": 120
}
Protobuf Payload Definition
This compact schema defines the batch request. By using delta-compression for timestamps, we reduce payload sizes even further.
syntax = "proto3";
message MetricReading {
uint32 metric_id = 1; // ID mapping (e.g., 1=temp, 2=humidity)
double value = 2; // Numeric sensor value
sint32 time_offset_ms = 3; // MS delta relative to batch base_timestamp
}
message TelemetryBatch {
string device_id = 1;
uint64 base_timestamp_ms = 2; // Absolute Epoch timestamp for reference
repeated MetricReading readings = 3;
}
Under the hood: Ingestion pipeline
To handle high-throughput writes without database locks, the ingestion path isolates the HTTP/MQTT listeners from the database using a durable message broker.
By the numbers: telemetry scaling & storage math
Let's calculate the ingestion load and storage footprint for a large-scale telemetry platform.
Governing Equations
- Total Raw Inbound Event Rate: $$Event_{rate} = \frac{N_{devices} \times F_{readings}}{T_{interval}}$$ Where $N_{devices}$ is the number of devices, $F_{readings}$ is the readings per device, and $T_{interval}$ is the interval in seconds.
- Daily Raw Data Size (Uncompressed): $$Size_{raw} = Event_{rate} \times Payload_{bytes} \times 86,400\text{ sec/day}$$
- Double-Delta TSDB Compression: High-density time-series data compresses exceptionally well using Gorilla compression (delta-of-delta for timestamps, XOR for float values). We achieve an average compression factor of **10× to 35×** compared to text-based JSON.
Scenario Parameters
- Active Devices ($N$): 1,000,000
- Transmission Interval ($T$): 10 seconds (1 reading per device every 10s)
- Average JSON Text Payload size: 180 bytes
- Equivalent Binary Protobuf size: 32 bytes
- Average TSDB Compressed record size on disk: 1.5 bytes
Worked Calculations: Network & Storage Footprint
| Metric | Text Payload (JSON) | Binary Payload (Protobuf) | TSDB Double-Delta Storage |
|---|---|---|---|
| Write Operations / sec | $100,000\text{ ops/s}$ | $100,000\text{ ops/s}$ | $20\text{ batches/s}$ (Using client-side buffering of 50 readings) |
| Ingress Network Bandwidth | $18.0\text{ MB/s} \approx 144\text{ Mbps}$ | $3.2\text{ MB/s} \approx 25.6\text{ Mbps}$ | $3.2\text{ MB/s}$ (at gateway) |
| Daily Ingested Volume | $1.55\text{ TB/day}$ | $276.4\text{ GB/day}$ | 12.96 GB/day |
| Storage Needed for 1 Year | $565.75\text{ TB}$ (Unusable at scale without partitions) | $100.88\text{ TB}$ | 4.73 TB |
Let's trace how 1,000,000 devices sending telemetry every 10 seconds compiles on disk:
- Compute daily event count: $$Events_{day} = \frac{1,000,000 \text{ devices}}{10 \text{ seconds}} \times 86,400 \text{ seconds/day} = 8,640,000,000 \text{ events/day}$$
- Compute raw JSON size (180 bytes/event): $$Size_{JSON} = 8.64 \times 10^9 \text{ events} \times 180 \text{ bytes} = 1.5552 \times 10^{12} \text{ bytes} \approx 1.55 \text{ TB/day}$$
- Compute Protobuf payload size (32 bytes/event): $$Size_{Proto} = 8.64 \times 10^9 \text{ events} \times 32 \text{ bytes} = 2.7648 \times 10^{11} \text{ bytes} \approx 276.4 \text{ GB/day}$$
- Compute TSDB storage with Gorilla Double-Delta compression (1.5 bytes/event): $$Size_{TSDB} = 8.64 \times 10^9 \text{ events} \times 1.5 \text{ bytes} = 1.296 \times 10^{10} \text{ bytes} \approx 12.96 \text{ GB/day}$$
By moving from raw JSON storage to binary protobuf ingestion and compressing time series data, daily storage demands fall from **1.55 TB to 12.96 GB** — a **99.1% reduction in storage cost**.
How to debug & inspect it
To debug IoT ingestion pipelines, monitor the message broker lag (to ensure consumers are keeping pace with device writes) and inspect binary payloads directly.
Use the guide below to address common failures in high-throughput append-only pipelines:
| Symptom | Likely Cause | Fix |
|---|---|---|
The ingest gateway returns HTTP 503 Service Unavailable under sudden device reconnect spikes |
Gateway threads are blocked waiting for writes to complete on the message broker | Ensure gateway writes to Kafka are asynchronous; allocate local memory buffers on gateways to spool events. |
| Database write IOPS spikes to 100% capacity; insertion speed plummets | Hypertables/Partitions are too large, forcing index updates to hit disk swap space instead of RAM | Decrease partition time-slice size (e.g. from 7 days to 1 day) to ensure the index for the active write range fits in RAM. |
| High data billing charges on device cellular plans | HTTP header overhead and verbose JSON formatting are consuming excessive bytes per report | Transition from JSON-over-HTTP to CoAP-over-UDP or Protobuf-over-MQTT; introduce client-side batching. |
🧠 Quick check
1. Why are traditional relational database B-Tree indices unsuitable for direct high-throughput IoT writes?
For large datasets, B-Tree indexes exceed memory limits. Inserting new records with random or non-sequential keys forces frequent disk reads/writes to update index pages, saturating disk IOPS.
2. How does a Time-Series Database (TSDB) partition strategy maintain constant-time ($O(1)$) write speeds?
By creating partition chunks based on timestamp ranges (e.g. daily), all new writes target the newest chunk. The index for this active partition is small enough to remain in memory, avoiding disk seeks.
3. Why is CoAP (over UDP) preferred over MQTT (over TCP) for extremely battery-constrained sensors?
TCP requires handshakes (SYN, SYN-ACK) and keep-alive packets to maintain a connection state. For sensors that sleep and wake up occasionally to send 20 bytes of data, UDP avoids connection overhead and conserves battery.
4. What is the role of a message queue (like Kafka) in the ingestion pipeline?
A message queue decouples the high-velocity ingestion layer from the database write layer, absorbing incoming traffic spikes and letting background consumers write data to the DB at a steady, manageable rate.
✍️ Exercise: design the rollup schema
Your TSDB stores 10 billion raw temperature readings ($12.96\text{ GB/day}$). The data retention policy specifies that raw data is deleted after 30 days. However, analysts need long-term temperature trends over a 3-year period.
Design a rollup query or system and the corresponding target table schema to store aggregated metrics. What aggregation window should you use, and what fields must the rollup table contain?
Model answer:
To retain long-term trends while purging raw data, we perform hourly or daily rollups. For long-term trend analysis, an hourly rollup window reduces the data volume by 360× (from 10s intervals to 1h intervals) while preserving temperature dynamics.
Rollup Table Schema (SQL):
CREATE TABLE metric_hourly_rollups (
device_id VARCHAR(64) NOT NULL,
metric_id INT NOT NULL,
hour_timestamp TIMESTAMP NOT NULL,
reading_count INT NOT NULL, -- Tracks number of raw events rolled up
min_value DOUBLE PRECISION, -- Minimum value in the hour
max_value DOUBLE PRECISION, -- Maximum value in the hour
avg_value DOUBLE PRECISION, -- Arithmetic mean
PRIMARY KEY (device_id, metric_id, hour_timestamp)
);
Key Design requirements:
- Separate metrics: Avoid storing raw points. Roll up values using `min`, `max`, and `avg`.
- Keep counts: Storing `reading_count` is essential if you ever need to combine multiple hourly rollups into a daily rollup (to calculate a weighted average: $\frac{\sum (\text{avg} \times \text{count})}{\sum \text{count}}$).
- Job scheduling: A cron task or continuous background trigger (e.g. TimescaleDB continuous aggregates) runs this query hourly:
INSERT INTO metric_hourly_rollups SELECT device_id, metric_id, date_trunc('hour', time) as hour_timestamp, COUNT(*), MIN(value), MAX(value), AVG(value) FROM raw_device_readings WHERE time >= NOW() - INTERVAL '2 hours' AND time < NOW() - INTERVAL '1 hour' GROUP BY device_id, metric_id, hour_timestamp;
Key takeaways
- **Telemetry APIs are write-heavy**. Avoid verbose HTTP structures and JSON text formats; use binary protocols (MQTT/CoAP) and Protobuf payload structures.
- **Decouple writes with a broker**. Run stateless edge ingest gateways that write directly to a message queue (Kafka) to guarantee sub-10ms write responses.
- **Index in memory**. B-Tree indexes on large tables fail at scale. Use a Time-Series Database (TSDB) that partitions tables into time chunks so indices fit in RAM.
- **Incorporate delta-compression**. TSDB storage engines use delta-of-delta compression (Gorilla) to reduce numeric telemetry storage by 90%+.
- **Implement Rollup Retention**. Purge high-resolution raw data after a short window (e.g., 30 days) and keep consolidated summaries (hourly averages, min/max) for long-term analytics.
Sources & further reading
- Gorilla Paper (Facebook) — A Fast, In-Memory Time Series Database — describes double-delta timestamp compression and XOR floating point compression
- TimescaleDB Architecture — Hypertables & Chunking — explanation of memory-resident index scaling for time-series
- MQTT Specification & Protocols — explaining the light-overhead design parameters for constrained networks