Reliability & Scale · Lesson 09
Monitoring & Observability
An API you cannot measure is an API you cannot trust. Observability is the engineering discipline of making a system's internal state inferable from the signals it emits — so you can answer "is it healthy?" without logging into the box to look.
By the end you'll be able to
- Name the three pillars of observability and what question each one answers.
- Apply the four golden signals and the RED / USE methods to decide what to measure.
- Explain how a distributed trace is built from spans, and why correlation IDs matter for debugging.
The dashboard that lies to you
A server that responds with HTTP 200 to every request can still be quietly wrong: returning stale data, taking 8 seconds per call, silently dropping one in fifty writes. A CPU graph at 30% doesn't catch any of that. Monitoring that shows only "server up / server down" is a smoke detector with no batteries — it passes every test until the house is on fire.
Observability is the practice of instrumenting a system so you can ask any question about its behaviour and get a real answer. The three complementary tools that make this possible are metrics, logs, and traces — the "three pillars."
The three pillars in depth
Metrics are numeric measurements sampled over time: request rate (req/s), p99 latency (ms), error count, CPU %. They're cheap to store as aggregates, easy to graph and alert on, and excellent for spotting trends. They cannot tell you why something is wrong — just that something is.
Logs are discrete, timestamped records of events: "user 42 attempted login, result: bad password." They contain rich context — user ID, request body, stack traces — and are invaluable for debugging specific incidents. The cost is storage: a busy service can emit gigabytes of logs per hour, so selective retention and structured formatting matter.
Traces follow a single request as it crosses service boundaries — from the edge load balancer through the API server to the database and back. Each hop produces a span with a start time and duration. The tree of spans is a trace. Traces reveal exactly which step is slow in a chain of microservice calls — the thing metrics and logs cannot show directly.
The four golden signals
Google's SRE book distils everything you need to measure about a user-facing service into four signals. If you only have time to instrument four things, instrument these:
| Signal | What it measures | Typical metric |
|---|---|---|
| Latency | How long requests take — distinguish successful from failed (a fast 500 is not success) | p50 / p99 / p999 response time |
| Traffic | Demand on the system | HTTP requests per second |
| Errors | Rate of failed requests — both explicit (5xx) and implicit (200 with wrong data) | Error rate % = 5xx / total |
| Saturation | How "full" the service is — leading indicator before failure | CPU %, queue depth, memory used / total |
RED and USE: two practical checklists
Two complementary methods reduce "what should I monitor?" to a short checklist:
- RED (for services / APIs): Rate, Errors, Duration. A microservice is healthy if its request rate is normal, its error rate is low, and its response durations are within SLO.
- USE (for resources — CPU, disks, queues): Utilisation, Saturation, Errors. A database connection pool is healthy if utilisation is under ~70%, the queue waiting for a connection (saturation) is near zero, and no connection errors occur.
Apply RED to every service endpoint; apply USE to every shared resource underneath it. Together they cover almost everything that causes incidents.
SLO-driven alerting
An alert that fires every time a metric crosses a hard threshold ("alert if p99 > 200 ms") is a recipe for alert fatigue — it fires during routine traffic spikes, wakes engineers unnecessarily, and trains people to ignore it. SLO-driven alerting inverts the question: instead of "is this metric unusual?" it asks "are we on track to exhaust this month's error budget?" (see Lesson 10 on SLOs).
The practical pattern is the burn-rate alert. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes of downtime. A burn rate of 1× means you'd exhaust it in exactly 30 days. A burn rate of 14× means you'd exhaust it in ~52 hours. Alerting on a high burn rate (e.g. >14× for 5 minutes) catches serious degradations quickly; a low sustained burn rate (e.g. >3× for 30 minutes) catches slow erosion before the budget disappears.
Distributed tracing: traces, spans, and correlation IDs
When a single user action travels through five microservices, a stack trace from service 3 alone isn't enough — you need the whole journey. Distributed tracing reconstructs that journey by attaching a trace ID to the incoming request and propagating it in every outbound call as an HTTP header (traceparent in the W3C standard, X-B3-TraceId in Zipkin). Each service creates a span and reports it (with the trace ID, a parent span ID, timing, and metadata) to a tracing back-end like Jaeger or Tempo.
A correlation ID (also called a request ID) is a simpler version of the same idea for monoliths or two-service architectures: generate a UUID at the edge, pass it in a header (X-Request-Id), and emit it in every log line. Now you can grep all logs for one ID and reconstruct the full lifecycle of a single request across log files.
Structured logging
Unstructured logs are English sentences: "User 42 logged in from 1.2.3.4 at 2025-09-01 10:00". Parsing that with regex is fragile. Structured logs emit machine-readable key-value records (JSON being the standard) so log aggregation platforms can filter, group, and alert on any field without text parsing:
{
"timestamp": "2025-09-01T10:00:00Z",
"level": "INFO",
"service": "auth-service",
"event": "user.login",
"user_id": 42,
"ip": "1.2.3.4",
"trace_id": "f3a8b9c1d2e3",
"duration_ms": 58
}
Every field is queryable. Finding all slow logins from a specific IP is now level=INFO AND event=user.login AND duration_ms > 500 AND ip=1.2.3.4 rather than a grep pipeline.
"How do you know your API is healthy?" is a reliability interview staple. A strong answer covers all three pillars (metrics for alerting, logs for debugging, traces for distributed bottlenecks), names the four golden signals (latency / traffic / errors / saturation), and ties alerting to SLOs rather than arbitrary thresholds. Bonus points: mention the difference between knowing something is wrong (alerting on error budget burn rate) versus diagnosing why (trace waterfall shows the slow span).
Two in one: alert fatigue and logging secrets. Threshold-based alerts that fire on every spike teach engineers to mute them — the real incident gets ignored in the noise. SLO burn-rate alerting fires rarely and always matters. Separately: structured logs capture rich context, but "rich context" must never include passwords, API keys, payment card numbers, or other PII. Log the user ID, not the session token; log "payment attempted", not the card number. Audit your log fields before they reach a third-party aggregator.
Do add a trace_id/request_id field to every log line emitted inside a request handler — it costs almost nothing and saves hours when debugging a production incident. Don't log unstructured strings and then try to parse them with regex later; structured JSON logs from day one pay compounding dividends in log query time.
Under the hood: how telemetry is actually collected
Counters vs histograms — and where p99 comes from
Prometheus (the de-facto standard) has four metric types. The two you see most often in API monitoring are Counter and Histogram.
A Counter is a monotonically increasing integer. You increment it when something happens: a request completes, an error fires, a payment is processed. The raw value is meaningless on its own — what matters is its rate of change. PromQL's rate() function computes how many increments per second occurred over a sliding window:
A Histogram tracks value distributions — request durations, payload sizes. Instead of storing every measurement, it increments a set of pre-defined buckets. A request of 87 ms increments every bucket whose upper bound is >= 87 ms:
The key insight: averages hide tail latency. A mean of 50 ms can coexist with a p99 of 2 000 ms if a small fraction of requests hit a slow code path. Histograms make the tail visible. Choose bucket boundaries that straddle your SLO threshold (if your SLO is 99% of requests under 300 ms, put a bucket at 0.3).
Trace context propagation — how a traceparent header carries a trace across services
The W3C Trace Context spec defines a header, traceparent, that encodes four fields as a hyphen-separated string:
Here is what happens when a request crosses three services — from creation to span assembly in the tracing back-end:
Every service only needs to: (1) read the incoming traceparent header; (2) create a child span using that trace-id and the received span-id as its parent; (3) propagate the header outbound with its own span-id as the new parent. OpenTelemetry auto-instruments most frameworks to do all three automatically.
How to debug & inspect monitoring
PromQL availability + p99 queries
Reading a trace waterfall to find the slow span
Open the trace in Jaeger, Tempo, or Honeycomb. Sort spans by duration descending. The widest child span is your bottleneck. The gap between spans (white space) is time the parent spent computing locally — narrow gaps and a wide child mean the child is the problem; wide gaps and narrow children mean the parent's own code is slow.
Using correlation/request IDs to stitch logs
If distributed tracing is not yet set up, a request ID in every log line provides the same cross-service join:
| Symptom | Likely cause | Fix |
|---|---|---|
| Alert fires but Grafana shows nothing unusual | Alert is on a raw counter, not a rate; or the time window is too short to show in the graph | Use rate() in both the alert rule and the graph; align time ranges |
| p99 is high but average looks fine | Tail latency from a slow code path (N+1 query, cache miss, downstream timeout) affecting a minority of requests | Break p99 down by handler label; use traces on the slow requests to find the wide span |
| Trace shows correct p99 in tracing tool but Prometheus p99 is different | Tracing is sampled (e.g. 1% of requests); Prometheus histogram captures all requests | Normal — traces are sampled for cost reasons; Prometheus metrics are always complete. Use metrics for alerting, traces for diagnosis |
| Spans arrive in the back-end with broken parent links (orphan spans) | One service strips the traceparent header before forwarding (proxy, WAF, or middleware resetting headers) |
Audit every hop that transforms headers; explicitly pass traceparent through proxies/load balancers |
| Logs from different services can't be correlated for a single request | No request ID generated, or ID not propagated to downstream services | Generate a UUID at the edge; pass it as X-Request-Id header; read it and include in every log line |
Debug checklist for "p99 is high, where is it?"
- Run
histogram_quantile(0.99, ...)broken out byhandler— identify which endpoint is slow. - Open a trace for a slow request in Jaeger/Tempo — sort child spans by duration.
- Find the widest span; check the gap before and after it for hidden local computation.
- Grep service logs by
trace_idorrequest_idto get the log-level context around the slow span. - If no trace is available: add a span around the suspected code path; re-deploy; re-measure.
By the numbers
Make the math concrete. Scenario: a payment API at 5,000 req/s, 1% trace sampling, Prometheus histograms with 6 latency buckets.
p99 via linear interpolation from histogram buckets
Prometheus histograms store cumulative counts per bucket. To find the p99 you locate the bucket that straddles the 99th-percentile rank and linearly interpolate within it.
This is the value histogram_quantile(0.99, ...) returns. The result is an approximation — the true p99 lives somewhere in [250 ms, 500 ms]. Bucket boundaries are the knobs: if your SLO is "99% of requests under 300 ms," add a bucket at le=0.3 to bracket the threshold precisely. (Prometheus histogram best practices)
Counter rate formula
Sampling: how many traces does 1% capture?
At 5,000 req/s and 1% head sampling:
Is 50 traces/s enough to compute p99 from traces? No — p99 requires a representative sample of the tail. With 50 traces/s sampled uniformly, you expect to capture roughly 50 × 0.01 = 0.5 of the slowest 1% per second — less than one slow trace per second. Histograms (which measure all 5,000 req/s) are better for p99; traces are better for diagnosing why a specific slow request was slow.
| Sampling rate | Traces/s @ 5,000 req/s | Tail trace capture (slowest 1%) per second | Storage at 10 KB/trace |
|---|---|---|---|
| 0.1% (1 in 1,000) | 5 | ~0.05/s | ~4.3 GB/day |
| 1% (1 in 100) | 50 | ~0.5/s | ~43 GB/day |
| 10% (1 in 10) | 500 | ~5/s | ~432 GB/day |
| 100% (no sampling) | 5,000 | 50/s | ~4.3 TB/day |
Decision math — sampling rate vs storage: at 1% sampling and 10 KB/trace, daily storage is ~43 GB, which is reasonable for most teams. To actually debug tail latency, use tail-based sampling (keep 100% of requests that took >500 ms, drop the fast ones) rather than head-based sampling. Tail sampling keeps the traces that matter and controls storage. (OpenTelemetry — Sampling concepts)
Decision math: bucket boundaries for useful p99 resolution
If your SLO threshold is 300 ms and your buckets jump from 250 ms to 500 ms, you cannot tell whether p99 is 260 ms (well inside SLO) or 490 ms (nearly double the SLO). The interpolation uncertainty is the entire bucket width. Add boundaries that straddle your SLO:
🧠 Quick check
1. You have a spike in p99 latency but no change in error rate. Which pillar tells you which service call is causing the slowdown?
Traces stitch together the timing of each service call in a single waterfall view. A wide span immediately identifies the bottleneck. Metrics tell you latency is high; traces tell you which hop caused it.
2. The RED method applies to which kind of system component?
RED (Rate, Errors, Duration) is designed for services that handle requests. USE (Utilisation, Saturation, Errors) is for the infrastructure resources underneath them.
3. What is the main advantage of SLO burn-rate alerting over simple threshold alerts?
A threshold alert fires every time a metric crosses a line — including harmless blips. A burn-rate alert fires when you're consuming your error budget faster than it replenishes, so every page represents a real threat to the SLO.
4. Which of the four golden signals is the best leading indicator — it warns you before user requests start failing?
Saturation measures how full a resource is: a connection pool at 95% or a CPU pegged at 98% will cause errors soon. Measuring saturation lets you scale up or shed load before users feel anything.
✍️ Exercise: instrument a payment API — what would you measure and why?
You're the first engineer at a startup building a payment processing API. You have one week to add meaningful observability before launch. Describe exactly what metrics, logs, and traces you would add, and write the alert(s) you would configure.
Model answer:
- Metrics (RED + golden signals):
payment_requests_total(counter, labels:status=success|declined|error) — traffic + errorspayment_request_duration_seconds(histogram, buckets at 0.1s, 0.5s, 1s, 5s) — latencypayment_queue_depth(gauge) — saturation leading indicator
- Structured logs (never log payment data):
- Every request:
{"event":"payment.attempt","payment_id":"pay_…","amount_cents":1000,"currency":"USD","trace_id":"…","duration_ms":80} - No card numbers, CVVs, or raw bank account numbers — ever
- Every request:
- Traces: Propagate
traceparentthrough → payment processor call → fraud check → DB write. Immediately reveals if the third-party processor is the latency source. - Alerts:
- SLO: 99.9% of payments succeed within 3 s. Error-budget burn-rate alert: >14× for 5 min → page on-call.
- Saturation: payment queue depth > 1000 for 2 min → warning (scale out before it becomes an error).
Rubric: ✓ uses RED method ✓ histogram for latency (not just average) ✓ structured logs with trace ID ✓ explicitly excludes PII from logs ✓ burn-rate alert tied to SLO ✓ saturation metric as leading indicator. Five or more = strong answer.
Key takeaways
- The three pillars — metrics, logs, traces — answer different questions: "is it happening?", "what happened?", "why is it slow?"
- The four golden signals (latency, traffic, errors, saturation) are the minimal set to instrument on any user-facing service.
- RED (Rate / Errors / Duration) applies to services; USE (Utilisation / Saturation / Errors) applies to resources.
- SLO burn-rate alerting fires only when the error budget is genuinely threatened, dramatically reducing alert fatigue.
- A trace ID propagated through every outbound call and log line makes incident debugging go from hours to minutes.
- Structured logs (JSON) are queryable; never log secrets or PII.