Core Idea

Latency percentiles (P50, P90, P95, P99) are the correct lens for measuring system performance because latency distributions are long-tailed — averages hide the worst-case experiences that real users encounter. The Nth percentile means N% of requests complete within that time; the remaining (100-N)% are slower.

Definition

A latency percentile (Pn) is the response time threshold below which n% of requests fall:

  • P50 (median): Half of requests complete within this time — your baseline, “typical” experience
  • P90: 90% of requests are this fast or faster — the majority experience
  • P95: Only 5% of requests are slower — an early-warning tier for emerging problems
  • P99: Only 1% of requests are slower — tail latency, the worst-case for most users
  • P99.9: Only 0.1% of requests are slower — extreme reliability requirements (financial, safety-critical)

Why Averages Fail

Latency distributions are not normal curves — they are long-tailed:

  • GC pauses (JVM stop-the-world events), cold starts, cache misses, and network jitter create rare but dramatic outliers
  • These outliers drag the arithmetic mean upward without affecting most users
  • Example: A system where 90% of requests complete in 5ms and 10% take 1,000ms has a mean of ~104ms — a number that describes nobody’s actual experience
  • The mean obscures the bimodal reality: most users see 5ms, some users see 1,000ms

The Percentile Ladder as a Diagnostic Tool

Each tier of the ladder signals something different:

  • Stable P50, rising P99 → Rare pathological events (GC, lock contention, noisy neighbour) — not systemic load
  • Rising P50 and P99 together → Systemic load problem — capacity or efficiency
  • Wide P50-to-P99 gap → High variance/inconsistency; even if both pass SLO targets, the experience is unpredictable
  • Narrow P50-to-P99 gap → Highly consistent system; the 99th percentile closely tracks the median

The Coordinated Omission Problem

Naive benchmarking tools systematically underreport bad latency — a phenomenon called coordinated omission (Gil Tene, Azul Systems):

  • When a benchmark tool stalls waiting for a slow response, it stops issuing new requests
  • This means the stall period generates zero latency measurements instead of thousands of “slow” ones
  • The resulting histogram looks excellent because the worst moments are invisible
  • Gil Tene: “The number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal.”
  • Correct approach: Tools like HdrHistogram account for scheduled-but-not-yet-issued requests during stalls

The 99th Percentile Is More Common Than It Seems

A common misconception: “P99 only affects 1% of users, so it’s not important.”

  • On a page that loads 200 resources (images, scripts, API calls), the probability that a user escapes P99 latency on every single one is 0.99^200 ≈ 13%
  • In other words, ~87% of page loads will be affected by at least one P99 event
  • For microservices systems with fan-out calls, see Tail-Latency for how this compounds

The Skewed Call-Count Pitfall

Aggregate percentiles pool all requests together and say nothing about individual clients or tenants:

  • A large enterprise customer generating 10× more requests than others (due to larger data volumes) may experience systematically worse latency
  • Yet the aggregate P95 can still pass because their requests are diluted across the distribution
  • Conversely, a high-volume client can dominate the distribution and mask the worse experience of lower-volume clients
  • There is no canned solution: whether to measure aggregate, per-tenant, or per-endpoint percentiles is a business decision based on client contracts, data volume distribution, and accountability commitments
  • For B2B SaaS with large enterprise customers, per-tenant percentiles are often the correct unit of accountability

Industry Benchmarks

DomainP99 TargetRationale
AdTech (RTB)< 100msReal-time bidding auction window
Financial trading< 500μsOrder execution competitiveness
E-commerce page load< 2sConversion rate preservation
Interactive APIs< 200msPerceived instantaneousness threshold

Google research: 500ms additional latency caused a 20% drop in search traffic. Amazon: 100ms of extra latency costs 1% of revenue.

Sources

Note

This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.