API Design

Reliability & Scale · Lesson 09

Monitoring & Observability

An API you cannot measure is an API you cannot trust. Observability is the engineering discipline of making a system's internal state inferable from the signals it emits — so you can answer "is it healthy?" without logging into the box to look.

⏱ 14 min Difficulty: core Prereq: SLIs, SLOs & SLAs

By the end you'll be able to

The dashboard that lies to you

A server that responds with HTTP 200 to every request can still be quietly wrong: returning stale data, taking 8 seconds per call, silently dropping one in fifty writes. A CPU graph at 30% doesn't catch any of that. Monitoring that shows only "server up / server down" is a smoke detector with no batteries — it passes every test until the house is on fire.

Observability is the practice of instrumenting a system so you can ask any question about its behaviour and get a real answer. The three complementary tools that make this possible are metrics, logs, and traces — the "three pillars."

Your API Metrics numbers over time "Is it fast / busy?" Logs timestamped events "What happened?" Traces request journeys "Why is it slow?"
Each pillar answers a different question. You need all three because none of them fully replaces the others.

The three pillars in depth

Metrics are numeric measurements sampled over time: request rate (req/s), p99 latency (ms), error count, CPU %. They're cheap to store as aggregates, easy to graph and alert on, and excellent for spotting trends. They cannot tell you why something is wrong — just that something is.

Logs are discrete, timestamped records of events: "user 42 attempted login, result: bad password." They contain rich context — user ID, request body, stack traces — and are invaluable for debugging specific incidents. The cost is storage: a busy service can emit gigabytes of logs per hour, so selective retention and structured formatting matter.

Traces follow a single request as it crosses service boundaries — from the edge load balancer through the API server to the database and back. Each hop produces a span with a start time and duration. The tree of spans is a trace. Traces reveal exactly which step is slow in a chain of microservice calls — the thing metrics and logs cannot show directly.

The four golden signals

Google's SRE book distils everything you need to measure about a user-facing service into four signals. If you only have time to instrument four things, instrument these:

SignalWhat it measuresTypical metric
Latency How long requests take — distinguish successful from failed (a fast 500 is not success) p50 / p99 / p999 response time
Traffic Demand on the system HTTP requests per second
Errors Rate of failed requests — both explicit (5xx) and implicit (200 with wrong data) Error rate % = 5xx / total
Saturation How "full" the service is — leading indicator before failure CPU %, queue depth, memory used / total

RED and USE: two practical checklists

Two complementary methods reduce "what should I monitor?" to a short checklist:

Apply RED to every service endpoint; apply USE to every shared resource underneath it. Together they cover almost everything that causes incidents.

SLO-driven alerting

An alert that fires every time a metric crosses a hard threshold ("alert if p99 > 200 ms") is a recipe for alert fatigue — it fires during routine traffic spikes, wakes engineers unnecessarily, and trains people to ignore it. SLO-driven alerting inverts the question: instead of "is this metric unusual?" it asks "are we on track to exhaust this month's error budget?" (see Lesson 10 on SLOs).

The practical pattern is the burn-rate alert. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes of downtime. A burn rate of 1× means you'd exhaust it in exactly 30 days. A burn rate of 14× means you'd exhaust it in ~52 hours. Alerting on a high burn rate (e.g. >14× for 5 minutes) catches serious degradations quickly; a low sustained burn rate (e.g. >3× for 30 minutes) catches slow erosion before the budget disappears.

Distributed tracing: traces, spans, and correlation IDs

When a single user action travels through five microservices, a stack trace from service 3 alone isn't enough — you need the whole journey. Distributed tracing reconstructs that journey by attaching a trace ID to the incoming request and propagating it in every outbound call as an HTTP header (traceparent in the W3C standard, X-B3-TraceId in Zipkin). Each service creates a span and reports it (with the trace ID, a parent span ID, timing, and metadata) to a tracing back-end like Jaeger or Tempo.

trace-id: f3a8b… total: 310 ms span: api-gateway 0 ms → 310 ms auth-service 12 ms → 70 ms orders-service 80 ms → 280 ms db-query (postgres) 140 ms → 250 ms cache 82 ms → 98 ms 0 ms 150 ms 310 ms
A trace waterfall: the root span wraps the entire request. Child spans show the exact timing of auth, a DB query, and a cache call. The wide DB span (110 ms) is immediately visible as the slowest step.

A correlation ID (also called a request ID) is a simpler version of the same idea for monoliths or two-service architectures: generate a UUID at the edge, pass it in a header (X-Request-Id), and emit it in every log line. Now you can grep all logs for one ID and reconstruct the full lifecycle of a single request across log files.

Structured logging

Unstructured logs are English sentences: "User 42 logged in from 1.2.3.4 at 2025-09-01 10:00". Parsing that with regex is fragile. Structured logs emit machine-readable key-value records (JSON being the standard) so log aggregation platforms can filter, group, and alert on any field without text parsing:

{
  "timestamp": "2025-09-01T10:00:00Z",
  "level":     "INFO",
  "service":   "auth-service",
  "event":     "user.login",
  "user_id":   42,
  "ip":        "1.2.3.4",
  "trace_id":  "f3a8b9c1d2e3",
  "duration_ms": 58
}

Every field is queryable. Finding all slow logins from a specific IP is now level=INFO AND event=user.login AND duration_ms > 500 AND ip=1.2.3.4 rather than a grep pipeline.

🎯 Interview angle

"How do you know your API is healthy?" is a reliability interview staple. A strong answer covers all three pillars (metrics for alerting, logs for debugging, traces for distributed bottlenecks), names the four golden signals (latency / traffic / errors / saturation), and ties alerting to SLOs rather than arbitrary thresholds. Bonus points: mention the difference between knowing something is wrong (alerting on error budget burn rate) versus diagnosing why (trace waterfall shows the slow span).

⚠️ Common trap

Two in one: alert fatigue and logging secrets. Threshold-based alerts that fire on every spike teach engineers to mute them — the real incident gets ignored in the noise. SLO burn-rate alerting fires rarely and always matters. Separately: structured logs capture rich context, but "rich context" must never include passwords, API keys, payment card numbers, or other PII. Log the user ID, not the session token; log "payment attempted", not the card number. Audit your log fields before they reach a third-party aggregator.

✅ Do this, not that

Do add a trace_id/request_id field to every log line emitted inside a request handler — it costs almost nothing and saves hours when debugging a production incident. Don't log unstructured strings and then try to parse them with regex later; structured JSON logs from day one pay compounding dividends in log query time.

Under the hood: how telemetry is actually collected

Counters vs histograms — and where p99 comes from

Prometheus (the de-facto standard) has four metric types. The two you see most often in API monitoring are Counter and Histogram.

A Counter is a monotonically increasing integer. You increment it when something happens: a request completes, an error fires, a payment is processed. The raw value is meaningless on its own — what matters is its rate of change. PromQL's rate() function computes how many increments per second occurred over a sliding window:

# Counter: http_requests_total{status="200"} = 4 817 390 # 5 minutes later: 4 819 580 rate(http_requests_total{status="200"}[5m]) → (4819580 - 4817390) / 300 = 7.3 req/s

A Histogram tracks value distributions — request durations, payload sizes. Instead of storing every measurement, it increments a set of pre-defined buckets. A request of 87 ms increments every bucket whose upper bound is >= 87 ms:

# Histogram buckets for http_request_duration_seconds: le=0.05 → 2 100 requests (completed in ≤ 50 ms) le=0.1 → 5 300 requests (completed in ≤ 100 ms) le=0.25 → 7 800 requests (completed in ≤ 250 ms) le=0.5 → 8 200 requests (completed in ≤ 500 ms) le=1 → 8 220 requests (completed in ≤ 1 000 ms) le=+Inf → 8 222 requests (total) # p99 = the latency below which 99% of requests fell # 99th percentile request = 8222 * 0.99 = 8139.78 → the 8140th fastest # 8140 falls between le=0.25 (7800) and le=0.5 (8200) histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) → ~0.38 s (linear interpolation between 0.25 and 0.5 buckets) # This is approximate — histograms trade precision for low storage cost.

The key insight: averages hide tail latency. A mean of 50 ms can coexist with a p99 of 2 000 ms if a small fraction of requests hit a slow code path. Histograms make the tail visible. Choose bucket boundaries that straddle your SLO threshold (if your SLO is 99% of requests under 300 ms, put a bucket at 0.3).

Trace context propagation — how a traceparent header carries a trace across services

The W3C Trace Context spec defines a header, traceparent, that encodes four fields as a hyphen-separated string:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^ version trace-id (128-bit hex) parent-span-id flags (01=sampled)

Here is what happens when a request crosses three services — from creation to span assembly in the tracing back-end:

# 1. Edge / API Gateway receives an inbound request with NO traceparent. # It generates a new trace-id and its own span-id: trace-id = 4bf92f3577b34da6a3ce929d0e0e4736 (random 128-bit UUID) span-id = 00f067aa0ba902b7 (random 64-bit) Records: {trace_id, span_id, parent=null, service="api-gw", start=0ms} # 2. API Gateway calls Orders Service; injects traceparent as HTTP header: GET /orders/99 HTTP/1.1 traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 Orders Service receives this, notes the parent-span-id (00f067aa0ba902b7) Creates its own span: {trace_id=same, span_id=ab12cd34, parent=00f067aa...} # 3. Orders Service calls Postgres; injects traceparent again: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-ab12cd34ef56gh78-01 DB span: {trace_id=same, span_id=ff00ee11, parent=ab12cd34, start=140ms, end=250ms} # 4. All three services emit their spans to the tracing back-end (Jaeger/Tempo). # Back-end assembles them by trace-id → produces the waterfall you saw earlier. root: api-gw (0→310ms) └─ orders-service (80→280ms) └─ db-query (140→250ms) ← widest child = the bottleneck

Every service only needs to: (1) read the incoming traceparent header; (2) create a child span using that trace-id and the received span-id as its parent; (3) propagate the header outbound with its own span-id as the new parent. OpenTelemetry auto-instruments most frameworks to do all three automatically.

How to debug & inspect monitoring

PromQL availability + p99 queries

# Availability: fraction of successful requests in the last 5 minutes 1 - ( rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) ) → 0.9994 (99.94% availability) # p99 latency across all endpoints: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) → 0.38 (380 ms) # p99 broken out per endpoint — find the slow one: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler) ) handler="/v1/orders" → 0.38 s handler="/v1/payments" → 1.92 s ← this is the outlier handler="/v1/healthz" → 0.004 s # Error-budget burn rate (SLO = 99.9%, window = 1h): burn_rate = (1 - availability_1h) / (1 - 0.999) # If burn_rate > 14 → page immediately (exhausts 30-day budget in ~52 h)

Reading a trace waterfall to find the slow span

Open the trace in Jaeger, Tempo, or Honeycomb. Sort spans by duration descending. The widest child span is your bottleneck. The gap between spans (white space) is time the parent spent computing locally — narrow gaps and a wide child mean the child is the problem; wide gaps and narrow children mean the parent's own code is slow.

# Trace 4bf92f... — total 310 ms api-gateway 0 ms ──────────────────────── 310 ms (310 ms total) auth-service 12 ms ──── 70 ms (58 ms — acceptable) orders-service 80 ms ──────────────────── 280 ms (200 ms total) db-query 140 ms ──────────── 250 ms (110 ms — WIDE = suspect) cache 82 ms ── 98 ms (16 ms — fast) # Gap analysis inside orders-service: # 80→82 ms: 2 ms own code before cache call (fine) # 98→140 ms: 42 ms own code between cache and DB call (investigate) # 250→280 ms: 30 ms own code after DB returns (investigate) # Action: add a child span around the 98→140 ms gap to find what's happening.

Using correlation/request IDs to stitch logs

If distributed tracing is not yet set up, a request ID in every log line provides the same cross-service join:

$ grep "req-f3a8b9c1" /var/log/api-gateway.log /var/log/orders.log /var/log/auth.log /var/log/api-gateway.log: {"ts":"10:00:00.000","req_id":"req-f3a8b9c1","event":"request.received","path":"/v1/orders/99"} /var/log/auth.log: {"ts":"10:00:00.012","req_id":"req-f3a8b9c1","event":"token.verified","user_id":42,"dur_ms":58} /var/log/orders.log: {"ts":"10:00:00.080","req_id":"req-f3a8b9c1","event":"order.fetch.start"} /var/log/orders.log: {"ts":"10:00:00.140","req_id":"req-f3a8b9c1","event":"db.query.start","query":"SELECT * FROM orders WHERE id=99"} /var/log/orders.log: {"ts":"10:00:00.250","req_id":"req-f3a8b9c1","event":"db.query.done","dur_ms":110} # The 110 ms db.query.done confirms what the trace showed.
SymptomLikely causeFix
Alert fires but Grafana shows nothing unusual Alert is on a raw counter, not a rate; or the time window is too short to show in the graph Use rate() in both the alert rule and the graph; align time ranges
p99 is high but average looks fine Tail latency from a slow code path (N+1 query, cache miss, downstream timeout) affecting a minority of requests Break p99 down by handler label; use traces on the slow requests to find the wide span
Trace shows correct p99 in tracing tool but Prometheus p99 is different Tracing is sampled (e.g. 1% of requests); Prometheus histogram captures all requests Normal — traces are sampled for cost reasons; Prometheus metrics are always complete. Use metrics for alerting, traces for diagnosis
Spans arrive in the back-end with broken parent links (orphan spans) One service strips the traceparent header before forwarding (proxy, WAF, or middleware resetting headers) Audit every hop that transforms headers; explicitly pass traceparent through proxies/load balancers
Logs from different services can't be correlated for a single request No request ID generated, or ID not propagated to downstream services Generate a UUID at the edge; pass it as X-Request-Id header; read it and include in every log line

Debug checklist for "p99 is high, where is it?"

  1. Run histogram_quantile(0.99, ...) broken out by handler — identify which endpoint is slow.
  2. Open a trace for a slow request in Jaeger/Tempo — sort child spans by duration.
  3. Find the widest span; check the gap before and after it for hidden local computation.
  4. Grep service logs by trace_id or request_id to get the log-level context around the slow span.
  5. If no trace is available: add a span around the suspected code path; re-deploy; re-measure.

By the numbers

Make the math concrete. Scenario: a payment API at 5,000 req/s, 1% trace sampling, Prometheus histograms with 6 latency buckets.

p99 via linear interpolation from histogram buckets

Prometheus histograms store cumulative counts per bucket. To find the p99 you locate the bucket that straddles the 99th-percentile rank and linearly interpolate within it.

# http_request_duration_seconds buckets (cumulative counts over 5 min): le=0.05 → 2,100 requests (≤ 50 ms) le=0.1 → 5,300 requests (≤ 100 ms) le=0.25 → 7,800 requests (≤ 250 ms) le=0.5 → 8,200 requests (≤ 500 ms) le=1.0 → 8,220 requests (≤ 1,000 ms) le=+Inf → 8,222 requests (total) Step 1 — p99 rank: total × 0.99 = 8,222 × 0.99 = 8,139.78 → the 8,140th request Step 2 — locate bucket: 8,140 > 7,800 (le=0.25) and 8,140 ≤ 8,200 (le=0.5) → p99 falls inside the (0.25 s, 0.5 s] bucket Step 3 — interpolate: bucket_low = 0.25 s, count at low = 7,800 bucket_high = 0.50 s, count at high = 8,200 p99 ≈ 0.25 + (8,140 − 7,800) / (8,200 − 7,800) × (0.50 − 0.25) = 0.25 + (340 / 400) × 0.25 = 0.25 + 0.2125 ≈ 0.46 s (460 ms)

This is the value histogram_quantile(0.99, ...) returns. The result is an approximation — the true p99 lives somewhere in [250 ms, 500 ms]. Bucket boundaries are the knobs: if your SLO is "99% of requests under 300 ms," add a bucket at le=0.3 to bracket the threshold precisely. (Prometheus histogram best practices)

Counter rate formula

rate = Δcount / Δt # Example: http_requests_total{status="200"} t=0: counter = 4,817,390 t=300s: counter = 4,819,580 (Δcount = 2,190 over Δt = 300 s) rate = 2,190 / 300 = 7.3 req/s # PromQL: rate(http_requests_total{status="200"}[5m]) → 7.3

Sampling: how many traces does 1% capture?

At 5,000 req/s and 1% head sampling:

traces_per_second = total_QPS × sample_rate = 5,000 × 0.01 = 50 traces/s traces_per_minute = 50 × 60 = 3,000 traces/min traces_per_hour = 50 × 3,600 = 180,000 traces/hr

Is 50 traces/s enough to compute p99 from traces? No — p99 requires a representative sample of the tail. With 50 traces/s sampled uniformly, you expect to capture roughly 50 × 0.01 = 0.5 of the slowest 1% per second — less than one slow trace per second. Histograms (which measure all 5,000 req/s) are better for p99; traces are better for diagnosing why a specific slow request was slow.

Sampling rateTraces/s @ 5,000 req/sTail trace capture (slowest 1%) per secondStorage at 10 KB/trace
0.1% (1 in 1,000)5~0.05/s~4.3 GB/day
1% (1 in 100)50~0.5/s~43 GB/day
10% (1 in 10)500~5/s~432 GB/day
100% (no sampling)5,00050/s~4.3 TB/day

Decision math — sampling rate vs storage: at 1% sampling and 10 KB/trace, daily storage is ~43 GB, which is reasonable for most teams. To actually debug tail latency, use tail-based sampling (keep 100% of requests that took >500 ms, drop the fast ones) rather than head-based sampling. Tail sampling keeps the traces that matter and controls storage. (OpenTelemetry — Sampling concepts)

Decision math: bucket boundaries for useful p99 resolution

If your SLO threshold is 300 ms and your buckets jump from 250 ms to 500 ms, you cannot tell whether p99 is 260 ms (well inside SLO) or 490 ms (nearly double the SLO). The interpolation uncertainty is the entire bucket width. Add boundaries that straddle your SLO:

# SLO: 99% of requests under 300 ms # Without le=0.3 bucket: uncertainty = ±125 ms (half the 250–500 ms bucket) # With le=0.3 bucket added: p99 is pinned to ≤ 300 ms or > 300 ms immediately # Recommended buckets for a latency SLO at 300 ms: [0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, 5.0, +Inf] ^SLO boundary here ─────^

🧠 Quick check

1. You have a spike in p99 latency but no change in error rate. Which pillar tells you which service call is causing the slowdown?

Traces stitch together the timing of each service call in a single waterfall view. A wide span immediately identifies the bottleneck. Metrics tell you latency is high; traces tell you which hop caused it.

2. The RED method applies to which kind of system component?

RED (Rate, Errors, Duration) is designed for services that handle requests. USE (Utilisation, Saturation, Errors) is for the infrastructure resources underneath them.

3. What is the main advantage of SLO burn-rate alerting over simple threshold alerts?

A threshold alert fires every time a metric crosses a line — including harmless blips. A burn-rate alert fires when you're consuming your error budget faster than it replenishes, so every page represents a real threat to the SLO.

4. Which of the four golden signals is the best leading indicator — it warns you before user requests start failing?

Saturation measures how full a resource is: a connection pool at 95% or a CPU pegged at 98% will cause errors soon. Measuring saturation lets you scale up or shed load before users feel anything.

✍️ Exercise: instrument a payment API — what would you measure and why?

You're the first engineer at a startup building a payment processing API. You have one week to add meaningful observability before launch. Describe exactly what metrics, logs, and traces you would add, and write the alert(s) you would configure.

Model answer:

  1. Metrics (RED + golden signals):
    • payment_requests_total (counter, labels: status=success|declined|error) — traffic + errors
    • payment_request_duration_seconds (histogram, buckets at 0.1s, 0.5s, 1s, 5s) — latency
    • payment_queue_depth (gauge) — saturation leading indicator
  2. Structured logs (never log payment data):
    • Every request: {"event":"payment.attempt","payment_id":"pay_…","amount_cents":1000,"currency":"USD","trace_id":"…","duration_ms":80}
    • No card numbers, CVVs, or raw bank account numbers — ever
  3. Traces: Propagate traceparent through → payment processor call → fraud check → DB write. Immediately reveals if the third-party processor is the latency source.
  4. Alerts:
    • SLO: 99.9% of payments succeed within 3 s. Error-budget burn-rate alert: >14× for 5 min → page on-call.
    • Saturation: payment queue depth > 1000 for 2 min → warning (scale out before it becomes an error).

Rubric: ✓ uses RED method ✓ histogram for latency (not just average) ✓ structured logs with trace ID ✓ explicitly excludes PII from logs ✓ burn-rate alert tied to SLO ✓ saturation metric as leading indicator. Five or more = strong answer.

Key takeaways

Sources & further reading