# Quantile Summary

# Quantile Summary

Quantile Summary maintains approximate quantiles of a stream using bounded memory. Instead of storing all values, it stores a compressed representation that allows queries such as median, percentiles, and rank estimates.

The goal is to answer queries of the form:

$$
\text{find } x \text{ such that } rank(x) \approx \phi n
$$

for a given quantile $\phi \in (0,1)$.

## Problem

Given a stream

$$
x_1, x_2, \dots, x_n
$$

support queries:

```id="qs1"
quantile(phi):
    return approximate value with rank near phi * n
```

with guaranteed error bounds, using limited memory.

## Data Structure

A quantile summary stores a sorted list of tuples:

```id="qs2"
(value, g, delta)
```

* `value`: observed data point
* `g`: minimum gap to previous value
* `delta`: maximum additional slack

These parameters bound the possible rank range of each stored value.

## Algorithm

### Insert

```id="qs3"
insert(x):
    find position i where value[i] <= x < value[i+1]

    insert (x, 1, floor(2 * epsilon * n)) at position i

    compress if needed
```

### Compress

Merge adjacent entries when their combined uncertainty stays within bounds.

```id="qs4"
compress():
    for i from end to beginning:
        if g[i] + g[i+1] + delta[i+1] <= floor(2 * epsilon * n):
            merge entries i and i+1
```

### Query

Scan accumulated ranks to find the value whose rank interval covers the target.

```id="qs5"
quantile(phi):
    r = floor(phi * n)
    running = 0

    for each entry:
        running += g
        if running + delta >= r:
            return value
```

## Error Guarantee

For parameter $\epsilon$, the summary ensures:

$$
|rank(x) - \phi n| \le \epsilon n
$$

This gives a deterministic bound on rank error.

## Example

Let the stream be:

$$
[7, 2, 9, 4, 3]
$$

A quantile summary might store a small subset such as:

$$
[(2, \cdot), (4, \cdot), (7, \cdot)]
$$

Querying for $\phi = 0.5$ returns an approximate median close to $4$.

## Correctness

Each tuple represents a range of possible ranks. The algorithm maintains invariants that bound the total uncertainty for each value.

Insertion adds a new value with minimal gap. Compression merges entries only when the resulting uncertainty remains within the allowed error bound.

Thus the summary always respects the rank error guarantee.

## Complexity

| operation |                                                 cost |
| --------- | ---------------------------------------------------: |
| insertion |                                $O(\log m)$ amortized |
| query     |                                               $O(m)$ |
| memory    | $O\left(\frac{1}{\epsilon} \log (\epsilon n)\right)$ |

Here $m$ is the number of stored entries.

## When to Use

Use Quantile Summary when:

* the data is streaming
* approximate percentiles are sufficient
* memory must be bounded
* deterministic error guarantees are required

For simpler implementations with similar guarantees, use Greenwald Khanna. For probabilistic alternatives, use sketches such as t-digest.

## Implementation

```id="qs6"
class QuantileSummary:
    def __init__(self, epsilon):
        self.epsilon = epsilon
        self.summary = []
        self.n = 0

    def insert(self, x):
        import bisect

        self.n += 1
        pos = bisect.bisect_left([v for v, _, _ in self.summary], x)

        if pos == 0 or pos == len(self.summary):
            delta = 0
        else:
            delta = int(2 * self.epsilon * self.n)

        self.summary.insert(pos, [x, 1, delta])
        self.compress()

    def compress(self):
        i = len(self.summary) - 2
        while i >= 0:
            v1, g1, d1 = self.summary[i]
            v2, g2, d2 = self.summary[i + 1]

            if g1 + g2 + d2 <= int(2 * self.epsilon * self.n):
                self.summary[i + 1][1] += g1
                self.summary.pop(i)

            i -= 1

    def quantile(self, phi):
        target = int(phi * self.n)
        running = 0

        for v, g, d in self.summary:
            running += g
            if running + d >= target:
                return v

        return None
```

