Argument
Architectural integrity requires operationalizing performance as measurable commitments. The SLI→SLO→SLA hierarchy — rooted in percentile-based latency targets — combined with error budgets and automated fitness functions, creates a feedback loop that makes performance governance continuous rather than periodic. Without this loop, “performance” remains a vague aspiration that degrades silently under deadline pressure.
1. From Measurement to Commitment
Most teams can describe their performance aspirations: “the system should be fast” or “we aim for high availability.” These aspirations are nearly worthless as architectural governance tools because they are unverifiable, unenforceable, and unmeasurable.
The transition from aspiration to commitment requires three things:
- A metric that captures the aspiration (SLI — the measurement)
- A target for that metric (SLO — the commitment)
- A consequence for missing the target (SLA — the contract, or internal error budget policy)
Once you have percentile-based SLIs (see Latency-Percentiles and Service-Level-Indicators), you can set specific, verifiable targets. “P95 latency < 200ms over a 30-day rolling window” is a governance object. It can be monitored, violated, and debated with precision. “The system should be performant” cannot.
This specificity also enables meaningful trade-off conversations. Instead of debating whether performance or maintainability matters more in the abstract, teams can ask: “Is P95 < 300ms acceptable if it lets us reduce deployment complexity?” The concrete target grounds the discussion in business value.
2. The SLI → SLO → SLA Hierarchy
SLI (Service Level Indicator): A quantitative measurement of system behaviour.
- Latency SLI: p95 request latency, measured in milliseconds over a rolling window
- Error SLI: Percentage of requests returning non-200 status codes
- Availability SLI: Successful requests / total requests
SLO (Service Level Objective): A target range for an SLI, maintained as an internal commitment.
- Example: “p95 latency < 200ms for 99.9% of measurements over any 30-day window”
- Missing an SLO triggers internal action: investigation, freeze on non-critical deployments, engineering sprint
- The SLO should maintain a safety margin below the SLA: if customers are promised 99.9% availability, the internal SLO targets 99.95%
SLA (Service Level Agreement): An external contract with consequences for violation.
- Examples: customer contracts with refund clauses, cloud provider uptime guarantees
- SLA violations have external consequences: financial penalties, customer churn, reputational damage
- The safety margin between SLO and SLA gives time to detect and respond before SLA breach
Key distinction: SLA missing triggers external consequences. SLO missing triggers internal action. This distinction matters because it determines urgency and response type.
3. Error Budgets as a Decision Framework
The error budget is the most powerful concept in SLO-driven governance. It transforms reliability from a subjective judgment into a quantified business resource:
- Error budget = 1 - SLO target
- Example: SLO of 99.5% availability → 0.5% error budget → approximately 3.6 hours allowed downtime per 30-day window
The error budget creates a natural governance mechanism:
| Budget Status | Interpretation | Action |
|---|---|---|
| Budget nearly exhausted | Reliability risk exceeds tolerance | Freeze non-critical deployments; prioritize reliability work |
| Budget burning fast (5× expected rate) | Acute reliability problem | Immediate investigation; consider incident response |
| Budget ample | System has reliability headroom | Accelerate feature velocity; reduce process overhead |
| Budget consistently unconsumed | SLO may be too loose | Consider tightening SLO target |
This governance mechanism is neutral: it makes the trade-off between reliability and feature velocity explicit and quantified, rather than leaving it to subjective judgment in each sprint planning meeting.
Google’s SRE book captures the philosophy: once the error budget is spent, the entire engineering team’s incentive aligns with reliability work. The budget creates a shared accountability structure rather than a developer-vs-SRE tension.
4. Latency SLO Targets by Service Type
Not all services warrant the same percentile target or stringency:
User-facing interactive services (APIs serving human users):
- Target: p95 latency
- Rationale: P95 represents the experience of the majority of users. P99 would be overly strict for optimizing UX; users don’t directly feel the difference between P95 and P99 at typical latency levels
- Typical target: < 200ms for perceived instantaneousness; < 500ms for acceptable; > 1s causes measurable abandonment
Internal platforms and B2B APIs:
- Target: p99 latency
- Rationale: Internal clients have no alternative service to switch to; all callers’ experiences matter. For B2B with enterprise customers, per-tenant p99 monitoring may be more appropriate than aggregate
- Typical target: < 500ms for most internal services; < 100ms for high-throughput data pipelines
Batch and background systems:
- Target: throughput SLIs (jobs per hour, queue depth, processing lag)
- Rationale: Individual request latency matters less than end-to-end batch completion time
- Typical target: Pipeline lag < 5 minutes; daily batch jobs complete within 2-hour window
Critical path financial or safety systems:
- Target: p99.9 or maximum
- Rationale: Every user’s experience matters equally; tail latency is a correctness concern, not just a UX concern
- Typical target: < 500μs for order execution; < 10ms for pricing feeds
5. Fitness Functions as SLO Enforcement
SLOs defined in documents degrade over time without enforcement. The most effective enforcement mechanism is Fitness Functions — automated checks integrated into CI/CD pipelines:
Pre-merge fitness functions:
- Run load tests against a staging environment on every pull request
- Block merge if p95 latency target would be violated
- Tools: Gatling, k6, JMeter with threshold assertions
Continuous production fitness functions:
- Monitor SLO burn rate in real-time
- Alert when error budget consumption rate exceeds 5× expected (not when raw threshold is crossed)
- Correlate anomalies with recent deployments for root-cause attribution
This closes the architecture governance loop:
- Architecture decision → SLO commitment
- SLO commitment → fitness function implementation
- Fitness function → automated enforcement in CI/CD
- Enforcement → no deployment that violates the commitment reaches production
Without step 3, the SLO is advisory. With it, the SLO becomes a hard constraint. This is the difference between Architectural-Governance as documentation and governance as enforcement.
The connection to Architecture-Decision-Records (ADRs) is direct: each SLO target is an architectural decision that warrants an ADR documenting the rationale, the measurement methodology, the target, and the enforcement mechanism. This prevents SLOs from drifting without conscious revision.
6. Anti-Patterns
SLOs from current performance (“We’re currently at 99.7% uptime, so let’s commit to 99.5%”):
- Bakes existing inefficiencies into the commitment
- Doesn’t reflect user needs or competitive requirements
- Correct approach: Start from user research and competitive analysis, then work backward to what the architecture must achieve
Monitoring averages instead of percentiles:
- Average latency can be green while significant portions of users experience unacceptable latency
- SLOs defined on averages are easily gamed by fast-path optimization that leaves slow-path users unaddressed
- Correct approach: Define SLIs using Latency-Percentiles (p95 or p99), not mean response time
No error budget policy:
- Defining SLOs without agreed-upon consequences for budget exhaustion makes them decoration
- Engineers don’t know whether to treat budget depletion as urgent
- Correct approach: Write the policy before the first SLO violation, not during it
SLOs without safety margins:
- Setting the SLO target at the same level as the SLA commitment means any SLO miss is immediately an SLA miss
- No time to detect and recover before external consequences trigger
- Correct approach: Internal SLO should be stricter than the external SLA by at least 0.1-0.5%
Synthesis
SLO-driven governance is not an operations concern — it is an architectural one. The choice of which percentile to target, what threshold to commit to, and how to enforce that commitment automatically are architectural decisions that determine whether performance is a first-class constraint or a nice-to-have aspiration.
The SLI→SLO→SLA hierarchy combined with error budgets creates a continuous feedback loop: measure precisely, commit explicitly, budget for unreliability, enforce automatically, and respond to budget burn as a governance signal. This loop makes performance governance continuous rather than a periodic review exercise that lags months behind the code.
Related Concepts
- Service-Level-Indicators — The SLI/SLO/SLA hierarchy in detail
- Latency-Percentiles — The measurement foundation for latency SLIs
- Fitness Functions — Automated enforcement of SLO thresholds in CI/CD
- Architectural-Governance — The broader governance framework SLOs operate within
- Architecture-Decision-Records — SLO targets as architectural decisions requiring documentation
- Operational-Measures — SLIs as a specific class of operational runtime metrics
- Availability — Availability as the most common SLA metric
Sources
-
Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. Chapter: “Service Level Objectives.” ISBN: 978-1-491-92912-4.
- Canonical reference for SLI/SLO/SLA hierarchy and error budget governance model
- Available: https://sre.google/sre-book/service-level-objectives/
-
Uptrace (2025). “Defining SLA/SLO-Driven Monitoring Requirements in 2025.” Uptrace Blog.
- Practitioner guide to implementing SLO-driven monitoring and alerting patterns
- Available: https://uptrace.dev/blog/sla-slo-monitoring-requirements
-
Nobl9 (2024). “SLO Metrics: A Best Practices Guide.” Nobl9 Blog.
- Best practices for SLI selection, SLO target-setting, and error budget policy
- Available: https://www.nobl9.com/service-level-objectives/slo-metrics
-
incident.io (2024). “SLOs, SLAs, and SLIs: A complete guide to service reliability metrics.” incident.io Blog.
- Comprehensive guide with anti-pattern analysis and implementation guidance
- Available: https://incident.io/blog/slo-sla-sli
-
Ford, Neal, Mark Richards, Pramod Sadalage, and Zhamak Dehghani (2022). Software Architecture: The Hard Parts - Modern Trade-Off Analyses for Distributed Architectures. O’Reilly Media. ISBN: 9781492086895.
- Governance and fitness function patterns for distributed architectures
Note
This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.