Calculating Percentiles on Streaming Data Part 1: Introduction


Calculating Percentiles on Streaming Data Part 1: Introduction

This is part 1/8 of my Calculating Percentiles on Streaming Data series.

Suppose that you are dealing with a system which processes one million requests per second, and you’d like to calculate the median percentile response time over the last 24 hours.

The naive approach would be to store every response time, sort them all, and then return the value in the middle. Unfortunately, this approach would require manipulating 1,000,000 * 60 * 60 * 24 = 86.4 billion values – almost certainly too many to fit into RAM, and thus rather unwieldy to work with. This begs the question “Is it possible to compute quantiles without storing every observation?”

Munro and Paterson [MP80] proved that a lower bound of $\Omega(n)$ space is required to exactly compute the median of $n$ values. However, if we’re allowed to compute an approximation, then there are a set of algorithms which can process the data in a single pass, and thus can be used without storing every observation. Notable algorithms with this property include:

  1. Manku, Rajagopalan, and Lindsay [MRL98] – single-pass, but an upper bound on $n$ must be known a priori; seems generally superseded by [GK01]
  2. Manku, Rajagopalan, and Lindsay [MRL99] – randomized algorithm
  3. Greenwald-Khanna [GK01] – single-pass, improves on the space bounds of [MRL98], and removes the requirement that $n$ is known in advance
  4. Gilbert, Kotidis, Muthukrishnan, and Strauss [GKMS02] – randomized algorithm, supports deletes as well as inserts
  5. Cormode and Muthukrishnan [CM04] – improves on the space bounds of [GKMS02]
  6. Cormode, Korn, Muthukrishnan, Divesh Srivastava [CKMS05] – improves [GK01] by better handling distributions with skew when finding targeted quantiles
  7. Tim Dunning’s T-Digest

The above list is intended to be representative, not exhaustive.

In this blog post series, I will explore various methods for calculating percentiles on streaming data.