API Design

Foundations · Lesson 10

SLIs, SLOs & SLAs

"Reliable" is meaningless until you measure it. These three acronyms turn a vague promise into numbers: what you measure (SLI), what you aim for (SLO), and what you promise contractually (SLA). They're how serious teams reason about quality — and a favourite interview topic because they reveal whether you think about operations, not just code.

⏱ 11 minDifficulty: corePrereq: Lesson 04

By the end you'll be able to

Three layers of the same promise

SLI — Indicator what you measure: "99.95% of requests succeeded" SLO — Objective your target: "≥ 99.9% monthly" SLA — Agreement the promise + penalty
A pyramid: you can't promise (SLA) what you don't target (SLO), and you can't target what you don't measure (SLI). Build bottom-up.
✅ The one-line distinction

SLI = the speedometer reading. SLO = the speed limit you set yourself. SLA = the ticket you pay if a cop catches you over a higher line. If you can say that, you understand all three.

Good indicators measure the user's experience

Pick SLIs that reflect what users actually feel, not what's easy to graph. The common ones for an API:

SLIMeasuresTypical phrasing
AvailabilityDid the request succeed at all?% of non-5xx responses
LatencyWas it fast enough?% of requests under 300 ms (use p99, not average — Lesson 04)
Error rateHow often it fails% of requests returning errors
Throughput / correctnessVolume served / right answerrequests/sec; % of correct results
⚠️ Common trap: chasing 100%

"We'll target 100% uptime" is a red flag, not an ambition. The last fraction of a percent costs exponentially more (redundant everything, no risky deploys) and is statistically unprovable. Worse, if users reach you through their own flaky networks, they can't tell the difference between 99.99% and 100% anyway. Good teams pick the lowest reliability target that keeps users happy — and spend the freed-up effort on features.

The error budget: turning an SLO into decisions

Here's the idea that makes SLOs powerful. If your SLO is 99.9% success, then 0.1% failures are allowed — that's your error budget. Over a 30-day month, 99.9% availability permits about 43 minutes of downtime. That budget is a currency you get to spend:

This dissolves the eternal dev-vs-ops fight ("move fast" vs "don't break things") into a shared number. The budget, not opinions, decides whether today is a day for features or for hardening.

Availability SLOAllowed downtime / 30 days
99%~7.2 hours
99.9% ("three nines")~43 minutes
99.99% ("four nines")~4.3 minutes
99.999% ("five nines")~26 seconds
🎯 Interview angle

In any design question, stating a target reframes the whole solution: "I'll target 99.9% availability and p99 latency under 300 ms." That single sentence justifies your redundancy, caching, and retry choices — each is in service of a number, not decoration. And if asked "why not five nines?", answer with cost and the error-budget trade-off. That's senior-level reasoning interviewers specifically look for.

Under the hood: how it actually works

How an SLI is computed from raw events

An SLI is not a single sensor reading — it is a ratio computed over a time window. The canonical formula:

## Availability SLI (most common for APIs)
SLI = good_events / valid_events

## Example over a 1-minute window
valid_events = all HTTP requests (excluding health checks, pre-planned maintenance)
good_events  = requests that returned a non-5xx response within the latency threshold

## If in one minute the server handled 10,000 requests:
##   9,987 returned 2xx/3xx/4xx within 300ms
##   13 returned 5xx or timed out
SLI (this minute) = 9987 / 10000 = 0.9987  = 99.87%

## The monthly SLI aggregates all 1-minute windows:
SLI (month) = sum(good_events over all windows) / sum(valid_events over all windows)

Critical design decision: what counts as valid? Client errors (4xx) caused by bad requests are typically excluded from the denominator because they don't reflect service health — a spike in 400s is a client problem, not a server problem. Only 5xx and timeouts count as failures.

Error budget arithmetic — worked example

The error budget is the complement of the SLO, expressed as time or events per window. Walk through it numerically:

## Given: SLO = 99.9% availability over a 30-day month

total_minutes = 30 × 24 × 60 = 43,200 minutes
budget_fraction = 1 - 0.999 = 0.001
allowed_bad_minutes = 43,200 × 0.001 = 43.2 minutes  ← the error budget

## If the service handles 1,000 requests/minute on average:
total_requests_per_month = 43,200 × 1,000 = 43,200,000
allowed_failures = 43,200,000 × 0.001 = 43,200 failed requests

## A 10-minute outage at 1,000 req/min costs:
outage_cost = 10 min × 1,000 req/min = 10,000 failures
budget_consumed = 10,000 / 43,200 ≈ 23% of the monthly budget

## Budget remaining after the outage:
remaining = 43,200 - 10,000 = 33,200 failures ≈ 33.2 budget-minutes left

Burn rate alerting

Checking whether the SLO was met at the end of the month is too late. Burn rate alerting fires when you are consuming budget faster than sustainable. The burn rate is how many times faster you are burning the budget compared to the allowed baseline:

## Burn rate = (current error rate) / (SLO error rate)
SLO = 99.9%  → SLO error rate = 0.001 (0.1%)

## If in the last hour the error rate is 1%:
burn_rate = 0.01 / 0.001 = 10x

## At burn rate 10x you exhaust the monthly budget in:
time_to_exhaust = 30 days / 10 = 3 days

## Common alert thresholds (Google SRE Workbook):
##   Page immediately (burn rate ≥ 14.4 → budget gone in <2 hours)
##   Ticket (burn rate ≥ 1 over 6 hours → 5% budget consumed)

Burn rate alerts avoid two failure modes: missing a slow, low-rate degradation that quietly drains the budget, and being flooded with alerts for brief spikes that recover quickly.

Days in month → Budget remaining 100% 50% 0% outage 1 10% consumed outage 2 20% left → alert 0 10 20 30
Error budget for a 99.9% SLO over 30 days. Outages create steep drops; a burn-rate alert fires when the remaining budget hits the threshold line — not at month's end when it is too late to react.

How to debug & inspect it

SLIs live in your metrics system. The queries below use PromQL (Prometheus Query Language) — the most common format — but the logic is identical in Datadog, CloudWatch, or any metrics platform.

# Availability SLI: fraction of non-5xx responses over the last 30 minutes sum(rate(http_requests_total{status!~"5.."}[30m])) / sum(rate(http_requests_total{status!~"4.."}[30m])) ; Excludes 4xx from the denominator (client errors ≠ service failures) # p99 latency SLI: 99th percentile request duration over the last 5 minutes histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) ) # Current burn rate (compare to SLO error rate = 0.001 for 99.9% SLO) 1 - ( sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total{status!~"4.."}[1h])) ) / 0.001 ; Result > 1 → burning faster than the SLO allows ; Result > 14.4 → page immediately (budget exhausted in <2 hours) # Error budget remaining for the month (assumes 30d window, SLO 99.9%) 1 - ( 1 - sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total{status!~"4.."}[30d])) ) / 0.001 ; 1.0 = full budget · 0.0 = exhausted · negative = overdrawn
Symptom / findingLikely causeAction
SLI drops suddenly at a known timeDeployment, config change, or dependency outage triggered at that timeCorrelate with deployment log; roll back or fix-forward; add change markers to dashboards
SLI looks fine but users report slownessSLI measures wrong thing — e.g. using average latency instead of p99; or excluding a subset of usersSwitch to percentile SLI; segment by region, user tier, endpoint; check which requests are slow
Burn rate alert fires but SLI looks normalAlert window is shorter than SLI window; a short spike consumed budget faster than the long average revealsAdd a short-window (5m) SLI chart next to the long-window (30d) one to detect spikes
SLI is 100% but the service is actually downMetrics pipeline itself is broken — if no requests flow, the ratio is undefined or the scraper is offlineAdd a request-rate alert: if rate(http_requests_total[5m]) == 0 during business hours → investigate
Error budget exhausted on day 3 of 30Major outage or high sustained error rateFreeze risky changes; hold incident retro; fix root cause; consider temporary SLO adjustment with stakeholders
SLI is 99.95% but SLA says 99.9% — am I safe?SLI and SLA windows may differ; SLA may exclude maintenance windowsRead the SLA definition carefully; measure your SLI over the exact same window and exclusions the SLA specifies

Checklist for a new SLI:

  1. Define what "valid" means — which requests/events count toward the denominator? (Exclude health checks, pre-planned maintenance, and client errors unless you want to measure them.)
  2. Define what "good" means — non-5xx? Under 300ms? Both? Write the exact metric query before setting a target.
  3. Check the SLI in production before setting the SLO — if current availability is 99.5%, a 99.9% SLO means your error budget is already gone.
  4. Add a rate alert alongside the SLI — if the request rate drops to zero, a perfect SLI is meaningless and potentially masking a total outage.
  5. Review with users: does a p99 latency SLI actually capture what makes them complain, or do you need to segment by endpoint or user tier?
⚠️ The Goodhart's Law trap

Once a metric becomes a target, people optimise for the metric rather than what it was measuring. Common examples: excluding "expected" error codes to hit the SLO number; targeting average latency (which ignores the slow tail that users actually feel); or setting an availability SLO on the load balancer health check instead of the real user flow. Guard against this by validating SLIs against user-reported complaints — the SLI should predict user pain, not just be easy to hit.

By the numbers

Error budgets and burn rates convert abstract SLOs into concrete operational decisions. The governing formulas:

error_budget = 1 − SLO # fraction of failures allowed budget_minutes_30d = (1 − SLO) × 30 × 24 × 60 # allowed downtime-equivalent in a month burn_rate = observed_error_rate / error_budget # how fast you are consuming the budget days_to_exhaust = 30 / burn_rate # days until budget is gone at current rate

Scenario: a checkout API running at 1,000 req/s with a 99.9% availability SLO. The team is observing a 1% error rate during an incident. How fast is the budget burning?

Error budget table — SLO targets over 30 days:

SLO targetError budget (fraction)Allowed downtime / 30 dAllowed failures @ 1k req/s
99%0.01 (1%)~7.2 hours~432,000
99.9% ("three nines")0.001 (0.1%)~43.2 min~43,200
99.95%0.0005 (0.05%)~21.6 min~21,600
99.99% ("four nines")0.0001 (0.01%)~4.3 min~4,320
99.999% ("five nines")0.00001 (0.001%)~26 sec~432

Worked burn-rate trace — 99.9% SLO, 1% observed error rate:

SLO = 99.9% → error_budget = 0.001 (0.1% failures allowed) observed_rate = 0.01 (1% of requests are failing right now) # Burn rate: burn_rate = 0.01 / 0.001 = 10× ← consuming budget 10× faster than sustainable # Days until budget exhausted: days_to_exhaust = 30 / 10 = 3 days ← the monthly budget is gone in 3 days # Budget consumed per hour at 1,000 req/s: failures_per_hour = 1,000 req/s × 3,600 s × 0.01 = 36,000 failures total_budget = 43,200 allowed failures this month consumed_per_hour = 36,000 / 43,200 ≈ 83% of monthly budget per hour of incident # At t=1h into incident: budget_remaining = 43,200 - 36,000 = 7,200 failures ≈ 7.2 budget-minutes left

Multi-window alert thresholds (following the Google SRE Workbook burn-rate model):

Alert windowBurn rate thresholdBudget consumed before alert firesSeverity
5 min + 1 h≥ 14.4×~2% (fires fast — budget gone in <2 hours)Page immediately
30 min + 6 h≥ 6×~5% (budget exhausted in ~5 days)Page (urgent)
6 h + 3 d≥ 1×~10% (slowly burning — ticket review)Ticket
End-of-month reviewSLO missed100% (too late to react)Post-mortem only
# Worked multi-window example: SLO=99.9%, error_budget=0.001 # Incident fires burn_rate=14.4× for 5 minutes AND 1 hour: burn_rate = 14.4 budget_per_hour = error_budget / 720 # 720 hours in 30 days fraction_consumed_in_1h = burn_rate × (1/720) = 14.4/720 ≈ 2% # → 2% of monthly budget gone in 1 hour; page-level alert fires. # Two windows required (5m and 1h both above threshold) to reduce false positives: # a 5-minute spike at 14.4× alone would only consume 14.4/(720×12) ≈ 0.17% — not critical # but if still above threshold at 1h window, the degradation is real and sustained.

Decision math — SLO target vs release velocity:

Looser SLO → larger error budget → more room to take risks → faster feature velocity Tighter SLO → smaller budget → any incident eats a large fraction → forced reliability focus Example: a 10-minute deploy-caused outage at 1k req/s at 99.9% SLO (budget = 43.2 min): 10 min = 23% of budget → painful but recoverable at 99.99% SLO (budget = 4.3 min): 10 min = 233% of budget → SLA breach on first outage Break-even: set SLO so a typical deploy outage burns <20% of monthly budget. If deploys cause ~10 min of error spikes per month: 10 / total_budget_minutes < 0.20 → total_budget_minutes > 50 min → SLO must be no tighter than 99.9% (43.2 min budget) for this deploy cadence Tighten SLO only after deploy-caused error time is reduced (blue/green, canary, feature flags).

Sources: Google SRE Workbook — Alerting on SLOs (burn rates); Google SRE Book — Service Level Objectives; Google Cloud — SRE fundamentals.

🧠 Quick check

1. Which is the contractual promise to a customer, with penalties for missing it?

SLA = Agreement: the external contract with consequences. The SLI is the measurement; the SLO is your internal target (usually stricter than the SLA).

2. Why is "target 100% availability" considered a poor goal?

Diminishing returns: each extra nine multiplies cost while delivering imperceptible benefit. Teams pick the lowest target that keeps users happy and spend the rest on features.

3. Your error budget for the month is exhausted. The disciplined response is to:

The error budget governs risk: out of budget means stop taking risks and harden the system. With budget to spare, you're free to move fast.

✍️ Drill: set SLOs for a checkout API

You own the payment-checkout API. Propose two SLIs, sensible SLOs, and how the SLA should relate. Decide first.

Model answer: SLIs: (1) availability = % of checkout requests returning success (non-5xx); (2) latency = % completing under, say, 500 ms at p99. SLOs: availability ≥ 99.95% monthly (payments are high-stakes, so stricter than a typical service), p99 latency ≤ 500 ms for ≥ 99% of requests. SLA: promise customers something looser, e.g. 99.9%, so the gap between SLO (99.95%) and SLA (99.9%) is your safety margin — you'll usually breach your internal target and react long before you owe anyone a refund. Tie it together: the strict SLO justifies redundancy, idempotent retries (Lesson 08), and aggressive monitoring.

Rubric: ✓ user-centric SLIs (success + latency at a percentile) ✓ SLO stricter than SLA with a stated margin ✓ justifies strictness by payment stakes ✓ connects targets to design choices.

Key takeaways

Sources & further reading