Skip to content

Glossary

Latency

Time between request and response

By Published Updated

Latency is the time between a request being sent and the response arriving. In networked systems it’s measured in milliseconds; in distributed systems sometimes in microseconds; in user-perceived latency, in the hundreds of milliseconds where humans start to notice.

Three measurements every engineer should know about a service’s latency:

  • Mean (average) latency. Usually misleading. A single slow outlier drags it up.
  • Median (p50) latency. The typical request’s experience. More honest than mean.
  • Tail latencies (p95, p99, p99.9). The 95th, 99th, 99.9th percentile of response times. p99 means “1% of requests are slower than this.” For user-facing systems, p99 captures the experience of unlucky users.

Why tails matter: at scale, every user hits the tail eventually. A service with 100ms p50 and 5000ms p99 has fast typical performance and occasional 5-second freezes. A user making 100 requests in a session will likely hit the tail at least once.

Sources of latency in a typical HTTP request:

  • DNS: 1-50 ms first lookup, ~0 cached.
  • TCP handshake: 1 round-trip time (RTT).
  • TLS handshake: 1-2 additional RTTs.
  • Server processing: highly variable, from microseconds to seconds.
  • Network propagation: ~5ms NY-to-Chicago, ~70ms NY-to-London, ~150ms NY-to-Sydney. Lower bound is light-speed.

For real-world API performance, the percentile distribution matters far more than the mean. Reporting only mean latency is one of the classic ways monitoring dashboards mislead.

Why p99 is the metric that defines “feels broken”: a service with 99% requests under 100 ms but 1% taking 5 seconds feels broken to every user who eventually hits the slow path. The math: if a typical page load makes 50 backend calls and p99 is 5 s, then the probability the user hits at least one slow call is 1 − (0.99)⁵⁰ ≈ 40%. Almost every other page load is slow. Improving the median is invisible; cutting p99 directly improves perceived performance. Google’s SRE handbook codified this principle (“tail latency is the latency”), and the convention has spread industry-wide. Reference: Dean & Barroso — The Tail at Scale (CACM, 2013).

Worked example: budget allocation across a request

Target: 200 ms p95 for a checkout page from a US user. Speed-of-light NY-London RTT is roughly 70 ms; that’s the floor for any transatlantic call. A typical fanout: 50 ms TLS + connection reuse savings, 40 ms regional DB read, 60 ms third-party payment authorisation, 20 ms HTML rendering, 30 ms client-side hydration. Adding these naively gives 200 ms — leaving zero headroom for retries, GC pauses, or noisy-neighbour spikes. The fix is structural: move the payment call off the critical path (defer to post-redirect), cache the DB read at edge for 60 s, and use HTTP/3 over QUIC to fold TLS into the connection setup. Each shaves 30-50 ms off the tail.

How to instrument it properly

Aggregating by mean discards the very shape of the distribution you need. Sample raw timings or use HDR histograms (Gil Tene’s HdrHistogram library, ported to most languages) which preserve percentiles cheaply. Compute percentiles per region, per endpoint, per release — a 5 ms global p99 regression can mean a 200 ms regression in a single region masked by averaging. Watch for “coordinated omission”: load generators that wait for a slow request before issuing the next one understate p99 by orders of magnitude. See also percentile and median. For protocol-level RTT breakdown, the RFC 9000 QUIC specification documents 0-RTT and 1-RTT connection establishment that materially reduces handshake latency.

Frequently asked questions

What is latency?
Latency is the time elapsed between sending a request and receiving the first byte of the response. It is typically measured in milliseconds and has three main components: propagation delay (speed of light over distance), processing delay (server computation), and queuing delay (network congestion).
How is latency different from throughput?
Latency measures delay for a single request; throughput measures how many requests or bytes per second a system can handle. A system can have high throughput and high latency simultaneously — like a bulk-cargo ship carrying many containers slowly.
Why do teams monitor p95 or p99 latency instead of average?
Averages hide the worst-case experience. The 99th percentile (p99) captures what the slowest 1% of users experience, which is often 5–10× the median. SLAs and user satisfaction are typically broken by tail latencies, not the median.
What is a realistic latency budget for a web page load?
A common target is under 200 ms for the first contentful paint, broken down roughly as: 50 ms DNS + TLS, 50 ms server processing, 100 ms network round-trips and rendering. Every additional API call, CDN miss, or synchronous script adds to the total.

Related

Published May 16, 2026 · Last reviewed May 31, 2026