At various points in my career, I have created or managed systems which need to perform calculations on enormous amounts of numeric data in real-time. Whether I was running a Varnish cluster that served 30,000 requests per second or writing financial analytics which needed to calculate aggregates on terabytes of price and return data in a few seconds, a common need emerged: how can I calcalate aggregates on large amounts of numeric data in a single pass?
The answer, of course, is streaming analytics: the ability to constantly calculate statistical analytics while moving within the stream of data. A streaming analytics system can calculate statistics in a single pass, in real time, and without necessarily having all the raw data (i.e. calculation as-you-go)
I decided to explore the family of approximate percentile algorithms such as Greenwald-Khanna and Cormode, Korn, Muthukrishnan, and Srivastava. This exploration turned into the following: