Argument

Average latency is a dangerously misleading metric. Percentile-based measurement (P50→P99) is the correct lens for system performance because latency distributions are long-tailed, tail latency compounds catastrophically across distributed fan-out calls, and most benchmarking tools silently undercount bad performance through coordinated omission. The right measurement unit — aggregate vs per-tenant vs per-endpoint — is a business decision, not a technical default.

1. The Problem with Averages

Performance dashboards showing “average response time: 45ms” convey an illusion of precision while hiding the most important information.

Latency distributions are not normal distributions. They are long-tailed:

  • The majority of requests complete quickly — near the fast “happy path”
  • A small fraction take dramatically longer due to GC pauses, cache misses, lock contention, or network jitter
  • These outliers drag the arithmetic mean upward without representing any particular user’s experience

Concrete example: A system where 90% of requests complete in 5ms and 10% take 1,000ms has a mean of ~104ms. No user experiences 104ms — 90% experience 5ms and 10% experience 1,000ms. The mean describes nobody.

This is not a theoretical concern. Production systems routinely exhibit this bimodal pattern because the “slow” requests hit a qualitatively different code path: they miss the cache, they trigger a GC pause, they wait behind a lock held by another thread.

The mean also fails as an SLO target. “Average latency < 200ms” can be met while 20% of users experience multi-second responses — a perfectly compliant metric masking a terrible user experience.

2. What Percentiles Reveal

Latency-Percentiles solve the averaging problem by measuring the distribution shape, not a single summary statistic:

  • P50 (median): The typical experience — half of users are faster, half slower. This is your baseline
  • P90: Nine out of ten users are this fast or faster — the majority experience
  • P95: Early-warning tier. 5% of requests are slower. When P95 rises but P50 stays flat, rare pathological events are emerging (not systemic load)
  • P99: Only 1% are slower. This is your structural problem detector — if P99 is high, something in the architecture creates occasional catastrophic slowdowns
  • P99.9: For extreme reliability requirements; rarely practical to optimize without specialized hardware and software

The P99/P50 ratio as a consistency signal: A system with P50 = 10ms and P99 = 12ms is highly consistent. A system with P50 = 10ms and P99 = 2,000ms has a severe consistency problem, regardless of whether either number violates its SLO threshold in isolation.

Reading the gap diagnostically:

  • Stable P50, rising P99 → Rare pathological events (GC, lock contention, noisy neighbour)
  • Rising P50 and P99 together → Systemic load problem — capacity or efficiency
  • Large P50-to-P99 gap at all times → High variance; the fast path and slow path are structurally different code paths

3. Coordinated Omission: Why Benchmarks Lie

Even after accepting percentiles over averages, a deeper problem threatens measurement validity: coordinated omission.

Most benchmarking tools — JMeter, Locust, wrk, and others in default configuration — send requests at a fixed rate and stop issuing new requests while waiting for slow responses. When a server stalls for 500ms, the benchmark tool simply waits. The 500ms stall period generates zero latency measurements instead of potentially thousands of “I was blocked for 500ms” measurements.

Gil Tene (Azul Systems), who coined the term, describes the result: “The 99th percentile that your tool reports is the 99th percentile of requests that the tool was not stalled during.” The worst moments are systematically invisible.

The correct approach uses tools like HdrHistogram (also by Tene), which tracks scheduled-but-not-yet-issued requests during stalls and correctly attributes the stall time to those waiting requests. The resulting histograms look dramatically worse — and more accurately reflect what users actually experience.

Gil Tene’s key insight: “The number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal.”

4. Tail Latency Amplification in Distributed Systems

In a monolith, a single GC pause causes one slow call. In a microservices system, a single user request triggers N parallel or sequential calls across services — and the probability of all N completing within their individual P99 threshold is 0.99^N.

See Tail-Latency for the full analysis, but the key numbers:

Fan-out (N services)System-wide P99 compliance
595.1%
1090.4%
2577.8%
5060.5%

This means architectural decisions about service granularity have direct, quantifiable latency consequences. A system decomposed into 25 microservices starts from a position where only ~78% of composite requests can meet each individual service’s P99 SLO — before any actual load-related problems occur.

Dean and Barroso (2013) documented this at Google scale and introduced hedged requests as the canonical mitigation: send duplicate requests to two replicas, use whichever responds first, discard the other. This doubles read load but dramatically reduces P99 latency for fan-out workloads.

5. Practical Decision Framework: When to Alert on Which Percentile

Different percentile tiers warrant different responses:

SignalInterpretationResponse
P50 risesSystemic load or regressionCapacity review, code profiling
P95 rises, P50 stableEmerging pathological eventsGC tuning, cache analysis, lock profiling
P99 rises, P50/P95 stableRare catastrophic eventsDeep tail analysis; consider hedged requests
P99 spikes correlate with deployRegression introducedRoll back, profile the change
P99 spikes without deployInfrastructure noiseNoisy-neighbour investigation, hardware check

Alert on SLO burn rate, not raw thresholds: Alerting when P99 exceeds 500ms for one minute produces alert storms during normal traffic variation. Alerting when the error budget is burning at 5× the expected rate catches both acute spikes and slow burns, while dramatically reducing page fatigue.

6. The Skewed Call-Count Pitfall: When Aggregate Percentiles Mislead

Aggregate percentiles pool all requests together and say nothing about individual clients or tenants. This creates a specific failure mode in B2B and multi-tenant systems:

  • A large enterprise customer generating 10× more requests than others (due to larger data volumes or higher usage intensity) may experience systematically worse latency — because their requests hit more data, more cache misses, more lock contention
  • Yet the aggregate P95 continues to pass, because their requests are diluted across the distribution
  • The SLO appears green while the enterprise customer’s actual experience is far worse than committed

The inverse problem also exists: a high-volume client can dominate the aggregate distribution, masking the worse experience of lower-volume clients whose requests cluster in the slow tail.

There is no canned solution. Whether to measure aggregate, per-tenant, per-endpoint, or per-customer-tier percentiles is a business decision that depends on:

  • The structure of customer contracts and SLA commitments
  • The distribution of data volumes and usage patterns across tenants
  • Whether the accountability unit is the service or the customer relationship

For B2B SaaS with large enterprise customers on differentiated contracts, per-tenant percentiles are often the correct unit of accountability. For high-volume consumer services where individual users are interchangeable, aggregate percentiles suffice.

Synthesis

Measuring performance correctly requires:

  1. Percentiles, not averages — the distribution shape tells the real story
  2. Coordinated-omission-aware benchmarks — use HdrHistogram-based tools; treat the maximum as signal
  3. Tail latency awareness in architecture — service granularity choices have direct P99 consequences
  4. The right diagnostic tier — use P50 for load, P95 for early warning, P99 for structural problems
  5. The right measurement unit — decide whether aggregate or per-tenant percentiles match accountability commitments

These are not measurement technicalities. They are the difference between architecture governance that reflects reality and governance that creates an illusion of control.

Sources

Note

This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.