Calculating Percentiles on Streaming Data Part 2: Notes on Implementing Greenwald-Khanna
Calculating Percentiles on Streaming Data
Published: 2018-03-07
Calculating Percentiles on Streaming Data Part 2: Notes on Implementing Greenwald-Khanna

This is part 2/8 of my Calculating Percentiles on Streaming Data series.

The most famous algorithm for calculating percentiles on streaming data appears to be Greenwald-Khanna [GK01]. I spent a few days implementing the Greenwald-Khanna algorithm from the paper and I discovered a few things I wanted to share.

Insert Operation

The insert operation is defined in [GK01] as follows:

INSERT(v)
Find the smallest i, such that vi1v<vi, and insert the tuple tx=(vx,gx,Δx)=(v,1,2ϵn), between ti1 and ti. Increment s. As a special case, if v is the new minimum or the maximum observation seen, then insert (v,1,0).

However, when I implemented this operation, I noticed via testing that there were certain queries I could not fulfill. For example, consider applying Greenwald-Khanna with ϵ=0.1 to the sequence of values 11,20,18,5,12,6,3,2, and then apply QUANTILE(ϕ=0.5). This means that r=ϕn=0.5×8=4 and ϵn=0.1×8=0.8. There is no i that satisfies both rrmin(vi)ϵn and rmax(vi)rϵn, as you can see below:

i ti rmin rmax rrmin rmaxr rrmin?ϵn rmaxr?ϵn
(2,1,0) 1 1 3 -3 F T
1 (3,1,0) 2 2 2 -2 F T
2 (5,1,0) 3 3 1 -1 F T
3 (6,1,1) 4 5 0 1 T F
4 (11,1,0) 5 5 -1 1 T F
5 (12,1,0) 6 6 -2 2 T F
6 (18,1,0) 7 7 -3 3 T F
7 (20,1,0) 8 8 -4 4 T F

I believe the fundamental problem is that the definition of INSERT(v) fails to maintain the key invariant maxi(gi+Δi)2ϵn. To correct this issue, the Greenwald-Khanna INSERT(v) operation must be modified as follows:

INSERT(v)*
Find the smallest i, such that vi1v<vi, and insert the tuple tx=(vx,gx,Δx)=(v,1,2ϵn1), between ti1 and ti. Increment s. As a special case, if v is the new minimum or the maximum observation seen, then insert (v,1,0). Also, the first 1/(2ϵ) elements must be inserted with Δi=0.

I found the above modification from Prof. Chandra Chekuri’s lecture notes for his class CS 598CSC: Algorithms for Big Data. I believe this modification will maintain the above invariant after 1/(2ϵ) insertions.

Band Construction

Banding in [GK01] refers to the grouping together of possible values of Δ for purposes of merging in the COMPRESS operation. The paper defines banding as the following:

BAND(Δ, 2ϵn)
For α from 1 to log2ϵn, we let p=2ϵn and we define bandα to be the set of all Δ such that p2α(pmod2α)<Δp2α1(pmod2α1). We define band0 to simply be p. As a special case, we consider the first 1/2ϵ observations, with Δ = 0, to be in a band of their own. We will denote by band(ti,n) the band of Δi at time n, and by bandα(n) all tuples (or equivalently, the Δ values associated with these tuples) that have a band value of α.

Here are some things I found when implementing banding:

  • It is important to note that in the above log refers to the base 2 logarithm of a number.
  • I found it useful to construct an array, bands, at the beginning of the COMPRESS operation, such that bands[Δ] is the band for the provided Δ.
  • I used a very large constant to denote the band of Δ=0, as this made the tree-building operation simpler.

Compress

Compression in [GK01] is defined as follows:

COMPRESS()
for i from s2 to 0 do …

However, with this definition, it is possible for the first (0th) tuple to be deleted, which would then violate the following invariant of the summary data structure: “We ensure that, at all times, the maximum and the minimum values are part of the summary.”

Therefore, I modified COMPRESS() to go from s2 to 1 so that the first tuple is never deleted and the invariant is maintained.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.