Foundations · Lesson 10
SLIs, SLOs & SLAs
"Reliable" is meaningless until you measure it. These three acronyms turn a vague promise into numbers: what you measure (SLI), what you aim for (SLO), and what you promise contractually (SLA). They're how serious teams reason about quality — and a favourite interview topic because they reveal whether you think about operations, not just code.
By the end you'll be able to
- Define SLI, SLO, and SLA and keep them straight.
- Pick good indicators and explain why "100% uptime" is the wrong target.
- Use an error budget to make a real engineering decision.
Three layers of the same promise
- SLI — Service Level Indicator: the actual measurement. e.g. "the fraction of requests that returned a 2xx within 300 ms." It's a number you collect, ideally as a ratio of good events to total.
- SLO — Service Level Objective: the target you set for an SLI, internally. e.g. "99.9% of requests succeed each month." This is the line your team holds itself to.
- SLA — Service Level Agreement: a contractual promise to a customer, with consequences (refunds, credits) if you miss. SLAs are deliberately looser than SLOs so you have a safety margin before money is on the line.
SLI = the speedometer reading. SLO = the speed limit you set yourself. SLA = the ticket you pay if a cop catches you over a higher line. If you can say that, you understand all three.
Good indicators measure the user's experience
Pick SLIs that reflect what users actually feel, not what's easy to graph. The common ones for an API:
| SLI | Measures | Typical phrasing |
|---|---|---|
| Availability | Did the request succeed at all? | % of non-5xx responses |
| Latency | Was it fast enough? | % of requests under 300 ms (use p99, not average — Lesson 04) |
| Error rate | How often it fails | % of requests returning errors |
| Throughput / correctness | Volume served / right answer | requests/sec; % of correct results |
"We'll target 100% uptime" is a red flag, not an ambition. The last fraction of a percent costs exponentially more (redundant everything, no risky deploys) and is statistically unprovable. Worse, if users reach you through their own flaky networks, they can't tell the difference between 99.99% and 100% anyway. Good teams pick the lowest reliability target that keeps users happy — and spend the freed-up effort on features.
The error budget: turning an SLO into decisions
Here's the idea that makes SLOs powerful. If your SLO is 99.9% success, then 0.1% failures are allowed — that's your error budget. Over a 30-day month, 99.9% availability permits about 43 minutes of downtime. That budget is a currency you get to spend:
- Budget remaining? Ship faster, run experiments, take risks — you've earned the room.
- Budget exhausted? Freeze risky launches and pour effort into reliability until you're back under budget.
This dissolves the eternal dev-vs-ops fight ("move fast" vs "don't break things") into a shared number. The budget, not opinions, decides whether today is a day for features or for hardening.
| Availability SLO | Allowed downtime / 30 days |
|---|---|
| 99% | ~7.2 hours |
| 99.9% ("three nines") | ~43 minutes |
| 99.99% ("four nines") | ~4.3 minutes |
| 99.999% ("five nines") | ~26 seconds |
In any design question, stating a target reframes the whole solution: "I'll target 99.9% availability and p99 latency under 300 ms." That single sentence justifies your redundancy, caching, and retry choices — each is in service of a number, not decoration. And if asked "why not five nines?", answer with cost and the error-budget trade-off. That's senior-level reasoning interviewers specifically look for.
Under the hood: how it actually works
How an SLI is computed from raw events
An SLI is not a single sensor reading — it is a ratio computed over a time window. The canonical formula:
## Availability SLI (most common for APIs)
SLI = good_events / valid_events
## Example over a 1-minute window
valid_events = all HTTP requests (excluding health checks, pre-planned maintenance)
good_events = requests that returned a non-5xx response within the latency threshold
## If in one minute the server handled 10,000 requests:
## 9,987 returned 2xx/3xx/4xx within 300ms
## 13 returned 5xx or timed out
SLI (this minute) = 9987 / 10000 = 0.9987 = 99.87%
## The monthly SLI aggregates all 1-minute windows:
SLI (month) = sum(good_events over all windows) / sum(valid_events over all windows)
Critical design decision: what counts as valid? Client errors (4xx) caused by bad requests are typically excluded from the denominator because they don't reflect service health — a spike in 400s is a client problem, not a server problem. Only 5xx and timeouts count as failures.
Error budget arithmetic — worked example
The error budget is the complement of the SLO, expressed as time or events per window. Walk through it numerically:
## Given: SLO = 99.9% availability over a 30-day month
total_minutes = 30 × 24 × 60 = 43,200 minutes
budget_fraction = 1 - 0.999 = 0.001
allowed_bad_minutes = 43,200 × 0.001 = 43.2 minutes ← the error budget
## If the service handles 1,000 requests/minute on average:
total_requests_per_month = 43,200 × 1,000 = 43,200,000
allowed_failures = 43,200,000 × 0.001 = 43,200 failed requests
## A 10-minute outage at 1,000 req/min costs:
outage_cost = 10 min × 1,000 req/min = 10,000 failures
budget_consumed = 10,000 / 43,200 ≈ 23% of the monthly budget
## Budget remaining after the outage:
remaining = 43,200 - 10,000 = 33,200 failures ≈ 33.2 budget-minutes left
Burn rate alerting
Checking whether the SLO was met at the end of the month is too late. Burn rate alerting fires when you are consuming budget faster than sustainable. The burn rate is how many times faster you are burning the budget compared to the allowed baseline:
## Burn rate = (current error rate) / (SLO error rate)
SLO = 99.9% → SLO error rate = 0.001 (0.1%)
## If in the last hour the error rate is 1%:
burn_rate = 0.01 / 0.001 = 10x
## At burn rate 10x you exhaust the monthly budget in:
time_to_exhaust = 30 days / 10 = 3 days
## Common alert thresholds (Google SRE Workbook):
## Page immediately (burn rate ≥ 14.4 → budget gone in <2 hours)
## Ticket (burn rate ≥ 1 over 6 hours → 5% budget consumed)
Burn rate alerts avoid two failure modes: missing a slow, low-rate degradation that quietly drains the budget, and being flooded with alerts for brief spikes that recover quickly.
How to debug & inspect it
SLIs live in your metrics system. The queries below use PromQL (Prometheus Query Language) — the most common format — but the logic is identical in Datadog, CloudWatch, or any metrics platform.
| Symptom / finding | Likely cause | Action |
|---|---|---|
| SLI drops suddenly at a known time | Deployment, config change, or dependency outage triggered at that time | Correlate with deployment log; roll back or fix-forward; add change markers to dashboards |
| SLI looks fine but users report slowness | SLI measures wrong thing — e.g. using average latency instead of p99; or excluding a subset of users | Switch to percentile SLI; segment by region, user tier, endpoint; check which requests are slow |
| Burn rate alert fires but SLI looks normal | Alert window is shorter than SLI window; a short spike consumed budget faster than the long average reveals | Add a short-window (5m) SLI chart next to the long-window (30d) one to detect spikes |
| SLI is 100% but the service is actually down | Metrics pipeline itself is broken — if no requests flow, the ratio is undefined or the scraper is offline | Add a request-rate alert: if rate(http_requests_total[5m]) == 0 during business hours → investigate |
| Error budget exhausted on day 3 of 30 | Major outage or high sustained error rate | Freeze risky changes; hold incident retro; fix root cause; consider temporary SLO adjustment with stakeholders |
| SLI is 99.95% but SLA says 99.9% — am I safe? | SLI and SLA windows may differ; SLA may exclude maintenance windows | Read the SLA definition carefully; measure your SLI over the exact same window and exclusions the SLA specifies |
Checklist for a new SLI:
- Define what "valid" means — which requests/events count toward the denominator? (Exclude health checks, pre-planned maintenance, and client errors unless you want to measure them.)
- Define what "good" means — non-5xx? Under 300ms? Both? Write the exact metric query before setting a target.
- Check the SLI in production before setting the SLO — if current availability is 99.5%, a 99.9% SLO means your error budget is already gone.
- Add a rate alert alongside the SLI — if the request rate drops to zero, a perfect SLI is meaningless and potentially masking a total outage.
- Review with users: does a p99 latency SLI actually capture what makes them complain, or do you need to segment by endpoint or user tier?
Once a metric becomes a target, people optimise for the metric rather than what it was measuring. Common examples: excluding "expected" error codes to hit the SLO number; targeting average latency (which ignores the slow tail that users actually feel); or setting an availability SLO on the load balancer health check instead of the real user flow. Guard against this by validating SLIs against user-reported complaints — the SLI should predict user pain, not just be easy to hit.
By the numbers
Error budgets and burn rates convert abstract SLOs into concrete operational decisions. The governing formulas:
Scenario: a checkout API running at 1,000 req/s with a 99.9% availability SLO. The team is observing a 1% error rate during an incident. How fast is the budget burning?
Error budget table — SLO targets over 30 days:
| SLO target | Error budget (fraction) | Allowed downtime / 30 d | Allowed failures @ 1k req/s |
|---|---|---|---|
| 99% | 0.01 (1%) | ~7.2 hours | ~432,000 |
| 99.9% ("three nines") | 0.001 (0.1%) | ~43.2 min | ~43,200 |
| 99.95% | 0.0005 (0.05%) | ~21.6 min | ~21,600 |
| 99.99% ("four nines") | 0.0001 (0.01%) | ~4.3 min | ~4,320 |
| 99.999% ("five nines") | 0.00001 (0.001%) | ~26 sec | ~432 |
Worked burn-rate trace — 99.9% SLO, 1% observed error rate:
Multi-window alert thresholds (following the Google SRE Workbook burn-rate model):
| Alert window | Burn rate threshold | Budget consumed before alert fires | Severity |
|---|---|---|---|
| 5 min + 1 h | ≥ 14.4× | ~2% (fires fast — budget gone in <2 hours) | Page immediately |
| 30 min + 6 h | ≥ 6× | ~5% (budget exhausted in ~5 days) | Page (urgent) |
| 6 h + 3 d | ≥ 1× | ~10% (slowly burning — ticket review) | Ticket |
| End-of-month review | SLO missed | 100% (too late to react) | Post-mortem only |
Decision math — SLO target vs release velocity:
Sources: Google SRE Workbook — Alerting on SLOs (burn rates); Google SRE Book — Service Level Objectives; Google Cloud — SRE fundamentals.
🧠 Quick check
1. Which is the contractual promise to a customer, with penalties for missing it?
SLA = Agreement: the external contract with consequences. The SLI is the measurement; the SLO is your internal target (usually stricter than the SLA).
2. Why is "target 100% availability" considered a poor goal?
Diminishing returns: each extra nine multiplies cost while delivering imperceptible benefit. Teams pick the lowest target that keeps users happy and spend the rest on features.
3. Your error budget for the month is exhausted. The disciplined response is to:
The error budget governs risk: out of budget means stop taking risks and harden the system. With budget to spare, you're free to move fast.
✍️ Drill: set SLOs for a checkout API
You own the payment-checkout API. Propose two SLIs, sensible SLOs, and how the SLA should relate. Decide first.
Model answer: SLIs: (1) availability = % of checkout requests returning success (non-5xx); (2) latency = % completing under, say, 500 ms at p99. SLOs: availability ≥ 99.95% monthly (payments are high-stakes, so stricter than a typical service), p99 latency ≤ 500 ms for ≥ 99% of requests. SLA: promise customers something looser, e.g. 99.9%, so the gap between SLO (99.95%) and SLA (99.9%) is your safety margin — you'll usually breach your internal target and react long before you owe anyone a refund. Tie it together: the strict SLO justifies redundancy, idempotent retries (Lesson 08), and aggressive monitoring.
Rubric: ✓ user-centric SLIs (success + latency at a percentile) ✓ SLO stricter than SLA with a stated margin ✓ justifies strictness by payment stakes ✓ connects targets to design choices.
Key takeaways
- SLI = the measurement, SLO = your internal target, SLA = the external promise (kept looser for margin).
- Choose SLIs that reflect the user's experience — availability and latency at a percentile, not averages.
- 100% is the wrong target; pick the lowest reliability that keeps users happy.
- The error budget (1 − SLO) turns reliability into a currency that decides when to ship vs harden.