401 — SLIs, SLOs & Error Budgets

Advanced

Learn to define Service Level Indicators (SLIs), set Service Level Objectives (SLOs), and manage error budgets to make data-driven reliability decisions.

Learning Objectives

1
Define meaningful SLIs for any service
2
Set and measure SLOs with error budgets
3
Implement SLO recording rules in Prometheus
4
Build SLO dashboards and burn-rate alerts
Step 1

What is a Service Level Indicator (SLI)?

An SLI is a quantitative measure of some aspect of the level of service provided. It is the metric that tells you how your service is performing from the user's perspective.

Commands to Run

cat <<'EOF'
=== SERVICE LEVEL INDICATORS (SLIs) ===

An SLI is a ratio of good events to total events:

  SLI = (good events / total events) * 100%

Common SLI types:
  Availability:  successful requests / total requests
  Latency:       requests < 300ms / total requests
  Correctness:   correct responses / total responses
  Throughput:    served requests / expected capacity
EOF

What This Does

SLIs measure what users actually experience. Unlike internal metrics like CPU usage, SLIs directly reflect service quality. The key insight is expressing SLIs as ratios (0-100%) so they are comparable across services and easy to set thresholds on. Always measure from the user's perspective, not from the server's.

Expected Outcome

You see the SLI formula and common SLI types printed to the terminal.

Pro Tips

  • 1
    Start with availability and latency SLIs — they cover most user-facing concerns
  • 2
    Measure SLIs at the load balancer or API gateway for the most accurate user perspective
Was this step helpful?

All Steps (0 / 10 completed)