Core Idea
The SLI/SLO/SLA hierarchy formalizes performance and reliability aspirations into measurable commitments. SLIs are the metrics you measure, SLOs are the targets you set for those metrics, and SLAs are the external contracts with consequences for missing them. Error budgets derived from SLOs turn reliability governance into a continuous decision-making framework rather than a periodic audit.
The three-tier hierarchy:
- SLI (Service Level Indicator): A quantitative measure of system behaviour—p95 request latency, p99 error rate, availability ratio. SLIs are facts, not targets
- SLO (Service Level Objective): A target range for an SLI, e.g., “p95 latency < 200ms over a 30-day rolling window.” Internal commitments—missing one triggers internal action (deployment freeze, engineering sprint). SLOs should maintain a safety margin below the external SLA
- SLA (Service Level Agreement): An externally committed contract with financial or legal consequences for violation. The key distinction: SLA violations have external consequences; SLO violations trigger internal response
Selecting the right percentile for your SLI:
- User-facing interactive services: Target p95—this is the majority experience
- Internal platforms and APIs: Target p99—all tenants matter equally and have no alternative
- Batch systems: Throughput-focused SLIs (jobs per hour, queue depth) rather than per-request latency
Error budgets as a governance mechanism: Error budget = 1 − SLO target. A 99.5% availability SLO yields 0.5%—roughly 3.6 hours of allowed downtime over 30 days. Budget burning fast means freeze non-critical deployments and prioritize reliability work. Budget ample means accelerate feature velocity. This converts reliability from a subjective debate into a quantified business trade-off.
Alerting on burn rate, not raw thresholds: Alerting when p99 exceeds 500ms for one minute produces alert storms and misses slow gradual degradation. Better: alert when the error budget is being consumed at 5× the expected rate for the past hour—catching both acute spikes and slow burns while reducing page fatigue.
Anti-patterns to avoid:
- Setting SLOs from current performance—bakes in existing inefficiencies rather than reflecting user needs
- Monitoring averages instead of percentiles—average latency hides tail latency (see Latency-Percentiles)
- Defining SLOs without an agreed-upon error budget policy—makes them decoration, not governance
Related Concepts
- Latency-Percentiles — The measurement foundation for latency SLIs
- Availability — SLA availability targets (nines) and how they relate to SLOs
- Fitness Functions — Automated enforcement of SLO thresholds in CI/CD
- Operational-Measures — SLIs as a specific class of operational runtime metrics
- Architecture-Decision-Records — SLO targets warrant ADR documentation as architectural decisions
Sources
-
Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. Chapter: “Service Level Objectives.” ISBN: 978-1-491-92912-4.
- Canonical definition of SLI/SLO/SLA hierarchy and error budget concept
- Available: https://sre.google/sre-book/service-level-objectives/
-
Uptrace (2025). “Defining SLA/SLO-Driven Monitoring Requirements in 2025.” Uptrace Blog.
- Practitioner guide to implementing SLO-driven monitoring and alerting
- Available: https://uptrace.dev/blog/sla-slo-monitoring-requirements
-
Nobl9 (2024). “SLO Metrics: A Best Practices Guide.” Nobl9 Blog.
- Best practices for selecting SLIs and setting realistic SLO targets
- Available: https://www.nobl9.com/service-level-objectives/slo-metrics
-
incident.io (2024). “SLOs, SLAs, and SLIs: A complete guide to service reliability metrics.” incident.io Blog.
- Comprehensive guide distinguishing the three tiers with practical examples
- Available: https://incident.io/blog/slo-sla-sli
Note
This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.