Argument

Architectural integrity requires operationalizing performance as measurable commitments. The SLI→SLO→SLA hierarchy — rooted in percentile-based latency targets — combined with error budgets and automated fitness functions, creates a feedback loop that makes performance governance continuous rather than periodic. Without this loop, “performance” remains a vague aspiration that degrades silently under deadline pressure.

1. From Measurement to Commitment

Most teams can describe their performance aspirations: “the system should be fast” or “we aim for high availability.” These aspirations are nearly worthless as architectural governance tools because they are unverifiable, unenforceable, and unmeasurable.

The transition from aspiration to commitment requires three things:

  1. A metric that captures the aspiration (SLI — the measurement)
  2. A target for that metric (SLO — the commitment)
  3. A consequence for missing the target (SLA — the contract, or internal error budget policy)

Once you have percentile-based SLIs (see Latency-Percentiles and Service-Level-Indicators), you can set specific, verifiable targets. “P95 latency < 200ms over a 30-day rolling window” is a governance object. It can be monitored, violated, and debated with precision. “The system should be performant” cannot.

This specificity also enables meaningful trade-off conversations. Instead of debating whether performance or maintainability matters more in the abstract, teams can ask: “Is P95 < 300ms acceptable if it lets us reduce deployment complexity?” The concrete target grounds the discussion in business value.

2. The SLI → SLO → SLA Hierarchy

SLI (Service Level Indicator): A quantitative measurement of system behaviour.

  • Latency SLI: p95 request latency, measured in milliseconds over a rolling window
  • Error SLI: Percentage of requests returning non-200 status codes
  • Availability SLI: Successful requests / total requests

SLO (Service Level Objective): A target range for an SLI, maintained as an internal commitment.

  • Example: “p95 latency < 200ms for 99.9% of measurements over any 30-day window”
  • Missing an SLO triggers internal action: investigation, freeze on non-critical deployments, engineering sprint
  • The SLO should maintain a safety margin below the SLA: if customers are promised 99.9% availability, the internal SLO targets 99.95%

SLA (Service Level Agreement): An external contract with consequences for violation.

  • Examples: customer contracts with refund clauses, cloud provider uptime guarantees
  • SLA violations have external consequences: financial penalties, customer churn, reputational damage
  • The safety margin between SLO and SLA gives time to detect and respond before SLA breach

Key distinction: SLA missing triggers external consequences. SLO missing triggers internal action. This distinction matters because it determines urgency and response type.

3. Error Budgets as a Decision Framework

The error budget is the most powerful concept in SLO-driven governance. It transforms reliability from a subjective judgment into a quantified business resource:

  • Error budget = 1 - SLO target
  • Example: SLO of 99.5% availability → 0.5% error budget → approximately 3.6 hours allowed downtime per 30-day window

The error budget creates a natural governance mechanism:

Budget StatusInterpretationAction
Budget nearly exhaustedReliability risk exceeds toleranceFreeze non-critical deployments; prioritize reliability work
Budget burning fast (5× expected rate)Acute reliability problemImmediate investigation; consider incident response
Budget ampleSystem has reliability headroomAccelerate feature velocity; reduce process overhead
Budget consistently unconsumedSLO may be too looseConsider tightening SLO target

This governance mechanism is neutral: it makes the trade-off between reliability and feature velocity explicit and quantified, rather than leaving it to subjective judgment in each sprint planning meeting.

Google’s SRE book captures the philosophy: once the error budget is spent, the entire engineering team’s incentive aligns with reliability work. The budget creates a shared accountability structure rather than a developer-vs-SRE tension.

4. Latency SLO Targets by Service Type

Not all services warrant the same percentile target or stringency:

User-facing interactive services (APIs serving human users):

  • Target: p95 latency
  • Rationale: P95 represents the experience of the majority of users. P99 would be overly strict for optimizing UX; users don’t directly feel the difference between P95 and P99 at typical latency levels
  • Typical target: < 200ms for perceived instantaneousness; < 500ms for acceptable; > 1s causes measurable abandonment

Internal platforms and B2B APIs:

  • Target: p99 latency
  • Rationale: Internal clients have no alternative service to switch to; all callers’ experiences matter. For B2B with enterprise customers, per-tenant p99 monitoring may be more appropriate than aggregate
  • Typical target: < 500ms for most internal services; < 100ms for high-throughput data pipelines

Batch and background systems:

  • Target: throughput SLIs (jobs per hour, queue depth, processing lag)
  • Rationale: Individual request latency matters less than end-to-end batch completion time
  • Typical target: Pipeline lag < 5 minutes; daily batch jobs complete within 2-hour window

Critical path financial or safety systems:

  • Target: p99.9 or maximum
  • Rationale: Every user’s experience matters equally; tail latency is a correctness concern, not just a UX concern
  • Typical target: < 500μs for order execution; < 10ms for pricing feeds

5. Fitness Functions as SLO Enforcement

SLOs defined in documents degrade over time without enforcement. The most effective enforcement mechanism is Fitness Functions — automated checks integrated into CI/CD pipelines:

Pre-merge fitness functions:

  • Run load tests against a staging environment on every pull request
  • Block merge if p95 latency target would be violated
  • Tools: Gatling, k6, JMeter with threshold assertions

Continuous production fitness functions:

  • Monitor SLO burn rate in real-time
  • Alert when error budget consumption rate exceeds 5× expected (not when raw threshold is crossed)
  • Correlate anomalies with recent deployments for root-cause attribution

This closes the architecture governance loop:

  1. Architecture decision → SLO commitment
  2. SLO commitment → fitness function implementation
  3. Fitness function → automated enforcement in CI/CD
  4. Enforcement → no deployment that violates the commitment reaches production

Without step 3, the SLO is advisory. With it, the SLO becomes a hard constraint. This is the difference between Architectural-Governance as documentation and governance as enforcement.

The connection to Architecture-Decision-Records (ADRs) is direct: each SLO target is an architectural decision that warrants an ADR documenting the rationale, the measurement methodology, the target, and the enforcement mechanism. This prevents SLOs from drifting without conscious revision.

6. Anti-Patterns

SLOs from current performance (“We’re currently at 99.7% uptime, so let’s commit to 99.5%”):

  • Bakes existing inefficiencies into the commitment
  • Doesn’t reflect user needs or competitive requirements
  • Correct approach: Start from user research and competitive analysis, then work backward to what the architecture must achieve

Monitoring averages instead of percentiles:

  • Average latency can be green while significant portions of users experience unacceptable latency
  • SLOs defined on averages are easily gamed by fast-path optimization that leaves slow-path users unaddressed
  • Correct approach: Define SLIs using Latency-Percentiles (p95 or p99), not mean response time

No error budget policy:

  • Defining SLOs without agreed-upon consequences for budget exhaustion makes them decoration
  • Engineers don’t know whether to treat budget depletion as urgent
  • Correct approach: Write the policy before the first SLO violation, not during it

SLOs without safety margins:

  • Setting the SLO target at the same level as the SLA commitment means any SLO miss is immediately an SLA miss
  • No time to detect and recover before external consequences trigger
  • Correct approach: Internal SLO should be stricter than the external SLA by at least 0.1-0.5%

Synthesis

SLO-driven governance is not an operations concern — it is an architectural one. The choice of which percentile to target, what threshold to commit to, and how to enforce that commitment automatically are architectural decisions that determine whether performance is a first-class constraint or a nice-to-have aspiration.

The SLI→SLO→SLA hierarchy combined with error budgets creates a continuous feedback loop: measure precisely, commit explicitly, budget for unreliability, enforce automatically, and respond to budget burn as a governance signal. This loop makes performance governance continuous rather than a periodic review exercise that lags months behind the code.

Sources

  • Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. Chapter: “Service Level Objectives.” ISBN: 978-1-491-92912-4.

  • Uptrace (2025). “Defining SLA/SLO-Driven Monitoring Requirements in 2025.” Uptrace Blog.

  • Nobl9 (2024). “SLO Metrics: A Best Practices Guide.” Nobl9 Blog.

  • incident.io (2024). “SLOs, SLAs, and SLIs: A complete guide to service reliability metrics.” incident.io Blog.

  • Ford, Neal, Mark Richards, Pramod Sadalage, and Zhamak Dehghani (2022). Software Architecture: The Hard Parts - Modern Trade-Off Analyses for Distributed Architectures. O’Reilly Media. ISBN: 9781492086895.

    • Governance and fitness function patterns for distributed architectures

Note

This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.