Core Idea

The SLI/SLO/SLA hierarchy formalizes performance and reliability aspirations into measurable commitments. SLIs are the metrics you measure, SLOs are the targets you set for those metrics, and SLAs are the external contracts with consequences for missing them. Error budgets derived from SLOs turn reliability governance into a continuous decision-making framework rather than a periodic audit.

The Three-Tier Hierarchy

SLI — Service Level Indicator:

  • A quantitative measure of system behaviour
  • Examples: p95 request latency, p99 error rate, availability (successful requests / total requests)
  • SLIs are facts, not targets — they are measurements

SLO — Service Level Objective:

  • A target range or threshold for an SLI
  • Examples: “p95 latency < 200ms, measured over a 30-day rolling window” or “error rate < 0.1% of requests”
  • SLOs are internal commitments — missing them triggers internal action (deployment freeze, engineering sprint)
  • SLOs should maintain a safety margin below the external SLA: if your SLA promises 99.9% uptime, your internal SLO might target 99.95%

SLA — Service Level Agreement:

  • An externally committed contract, typically with financial or legal consequences for violation
  • Examples: customer contracts, cloud provider commitments (AWS EC2: 99.99% monthly uptime)
  • Key distinction: SLA violations have external consequences (refunds, penalties, churn); SLO violations trigger internal response

Common Latency SLIs

  • p50 request latency: Median; baseline experience
  • p95 request latency: Most users’ worst-case; preferred SLI for interactive services
  • p99 request latency: Tail latency; preferred for internal platforms and B2B services where all tenants matter equally
  • Error rate: Percentage of requests returning 5xx or timeout
  • Availability ratio: (Successful requests / Total requests) × 100%

Setting Realistic Latency SLOs

Start from user research, not current performance:

  • Ask: what latency causes users to abandon or complain? What does your competitor offer?
  • Avoid the anti-pattern of setting SLOs based on what the system currently delivers — this bakes in existing inefficiencies

Match the percentile to the service type:

  • User-facing interactive services: Target p95 — this is the experience of the majority
  • Internal platforms and APIs: Target p99 — internal clients have no alternative; all experience levels matter
  • Batch systems: Throughput-focused SLIs (jobs per hour, queue depth) rather than per-request latency

Maintain a safety margin:

  • Internal SLO target should be stricter than the external SLA commitment
  • Gives time to detect and respond before SLA violation occurs

Error Budgets

An error budget is the allowable unreliability derived from the SLO:

  • Error budget = 1 - SLO target
  • Example: SLO of 99.5% availability → 0.5% error budget
  • Over a 30-day window, 0.5% = approximately 3.6 hours of allowed downtime

Error budgets as a governance mechanism:

  • Budget burning fast → Freeze non-critical deployments; prioritize reliability work
  • Budget ample → Accelerate feature velocity; the system has reliability headroom
  • This turns reliability from a subjective debate (“are we reliable enough?”) into a quantified business trade-off

Alerting on Burn Rate, Not Raw Thresholds

Anti-pattern: Alert when p99 latency exceeds 500ms for one minute

  • Produces alert storms during normal traffic spikes
  • Misses slow, gradual SLO degradation that stays below the threshold

Better approach: Alert on SLO burn rate

  • “Error budget is being consumed at 5× the expected rate for the past hour”
  • Catches both acute spikes and slow burns
  • Reduces page fatigue by alerting on trajectory, not instantaneous threshold crossings

Anti-Patterns

  • SLOs from current performance: “We’re currently at 99.7% uptime, so let’s SLO at 99.5%” — bakes in existing problems and doesn’t reflect user needs
  • Monitoring averages instead of percentiles: Average latency hides tail latency that determines SLO compliance for real users (see Latency-Percentiles)
  • No error budget policy: Defining SLOs without agreed-upon responses to budget exhaustion makes them decoration, not governance

Sources

  • Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. Chapter: “Service Level Objectives.” ISBN: 978-1-491-92912-4.

  • Uptrace (2025). “Defining SLA/SLO-Driven Monitoring Requirements in 2025.” Uptrace Blog.

  • Nobl9 (2024). “SLO Metrics: A Best Practices Guide.” Nobl9 Blog.

  • incident.io (2024). “SLOs, SLAs, and SLIs: A complete guide to service reliability metrics.” incident.io Blog.

Note

This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.