# 5.13 Count-Min Sketch

# 5.13 Count-Min Sketch

## Problem

You need approximate frequency counts for a large stream of keys, and an exact counting map would use too much memory.

A counting map stores one entry per distinct key. This is exact, but memory grows with the number of distinct keys. In a large log stream, network trace, search query stream, or telemetry pipeline, the number of distinct keys may be too large to store directly.

## Solution

Use a count-min sketch. It stores a small two-dimensional table of counters and uses several hash functions.

```text id="l2j9qc"
CountMinSketch:
    counters: d rows by w columns
    d: number of hash functions
    w: number of counters per row
```

To add one occurrence of a key:

```text id="gbqgmr"
add(sketch, key):
    for row from 0 to d - 1:
        col = hash_row(row, key) mod w
        counters[row][col] = counters[row][col] + 1
```

To estimate the count of a key:

```text id="fds733"
estimate(sketch, key):
    answer = infinity

    for row from 0 to d - 1:
        col = hash_row(row, key) mod w
        answer = min(answer, counters[row][col])

    return answer
```

The estimate is the minimum counter among the rows.

## Discussion

A count-min sketch compresses a frequency table into a fixed amount of memory. Each key is mapped to one counter in each row. Adding a key increments all of those counters. Querying a key checks the same counters and returns the smallest value.

The estimate never falls below the true count. Every occurrence of the key increments all its selected counters. Collisions with other keys can only add extra increments. Taking the minimum across rows reduces the effect of those collisions.

This makes the structure useful when overestimation is acceptable. Examples include finding heavy hitters, ranking frequent search queries, detecting high-volume IP addresses, and estimating event counts in telemetry streams.

A count-min sketch differs from a Bloom filter. A Bloom filter approximates membership. A count-min sketch approximates frequency. Both use hashing and fixed memory, but their answers have different meanings.

## Correctness

Let `count(x)` be the true number of times key `x` has been added.

For every row, the counter selected by `x` is incremented once for each occurrence of `x`. Therefore, each selected counter is at least `count(x)`.

Other keys may collide with `x` in a row and add extra increments to that row's counter. Therefore, each selected counter is:

```text id="rgkas6"
count(x) + collision_noise
```

where `collision_noise >= 0`.

The query returns the minimum of these counters. Since every selected counter is at least `count(x)`, the estimate is never smaller than the true count. Since the minimum tends to select the row with the least collision noise, it is usually closer to the true count than any single row would be.

## Complexity

Let `d` be the number of rows and `w` be the number of counters per row.

Insertion takes `O(d)` time.

Query takes `O(d)` time.

Space usage is `O(d w)` counters.

The cost is independent of the number of distinct keys in the stream.

## Example

Suppose the sketch has `3` rows and `8` columns.

After processing a stream, key `"go"` maps to these counters:

```text id="bmsuqz"
row 0, col 2: 18
row 1, col 5: 15
row 2, col 1: 17
```

The estimate is:

```text id="n46o22"
min(18, 15, 17) = 15
```

If the true count of `"go"` is `14`, the estimate overcounts by `1`. If the true count is `15`, the estimate is exact.

## Parameter Choice

Increasing `w` reduces collisions within each row.

Increasing `d` reduces the probability that every row has significant collision noise.

Use wider rows when the stream contains many distinct keys. Use more rows when you need higher confidence. Both choices increase memory or CPU cost.

A practical implementation often chooses `d` between `4` and `8`, then sets `w` according to the memory budget.

## Implementation Notes

Use independent hash functions or derive row-specific hashes from a strong base hash.

Use counters wide enough for the stream length. Small counters can overflow silently.

For weighted updates, increment by `weight` instead of `1`.

For sliding windows, combine the sketch with time buckets and expire old buckets.

For conservative update, increment only counters currently equal to the minimum estimate. This can reduce overestimation in some workloads.

## Common Mistakes

Expecting exact counts.

Forgetting that estimates are biased upward.

Using too few columns for a high-cardinality stream.

Allowing counters to overflow.

Comparing estimates from sketches built with different parameters or hash seeds.

Using count-min sketch when underestimation is safer than overestimation.