# Count Min Sketch Query

# Count Min Sketch Query

Count Min Sketch is a probabilistic data structure for estimating item frequencies in a stream. It stores counts in a small two dimensional table and uses several hash functions to update and query the table.

It gives approximate answers with one-sided error: the estimated count is never below the true count, but it may be above it because of hash collisions.

## Problem

Given a stream

$$
x_1, x_2, \dots, x_n
$$

support approximate frequency queries:

$$
\text{estimate}(x) \approx \text{count}(x)
$$

without storing a full frequency table.

## Algorithm

Use $d$ hash functions and a table with $d$ rows and $w$ columns.

To add an item, increment one counter per row.

```id="cms1"
count_min_add(C, hashes, x):
    for r from 0 to d - 1:
        c = hashes[r](x) mod w
        C[r][c] = C[r][c] + 1
```

To query an item, compute all matching counters and return the minimum.

```id="cms2"
count_min_query(C, hashes, x):
    answer = infinity

    for r from 0 to d - 1:
        c = hashes[r](x) mod w
        answer = min(answer, C[r][c])

    return answer
```

## Key Idea

Each counter may include counts from other items that collide with the query item. Therefore each row gives an overestimate. Taking the minimum across rows reduces the collision effect.

## Error Bound

With suitable choices of $w$ and $d$, Count Min Sketch gives:

$$
\hat{f}(x) \le f(x) + \epsilon n
$$

with probability at least:

$$
1 - \delta
$$

where $f(x)$ is the true frequency and $\hat{f}(x)$ is the estimate.

A common parameter choice is:

$$
w = \lceil e / \epsilon \rceil
$$

and

$$
d = \lceil \ln(1 / \delta) \rceil
$$

## Example

Suppose a stream contains many events:

$$
[a, b, a, c, a, d, b, a]
$$

The true count of $a$ is $4$. A Count Min Sketch query for $a$ may return $4$ or a larger value, depending on collisions. It will not return less than $4$.

## Correctness

Each occurrence of item $x$ increments exactly one counter in each row used by $x$. Therefore every queried counter includes all occurrences of $x$.

Other items may add extra increments through collisions. Thus every row gives an upper bound on the true count. The minimum of upper bounds is still an upper bound, and usually tighter than any single row.

## Complexity

| operation |    cost |
| --------- | ------: |
| update    |  $O(d)$ |
| query     |  $O(d)$ |
| memory    | $O(wd)$ |

The structure is suitable for large streams because memory does not depend on the number of distinct items.

## When to Use

Use Count Min Sketch when:

* the stream is too large for exact counting
* queries arrive online
* small overestimates are acceptable
* memory must be fixed in advance

Avoid it when exact counts are required or when underestimates would be preferable.

## Implementation

```id="cms3"
import hashlib
import math

class CountMinSketch:
    def __init__(self, epsilon, delta):
        self.w = math.ceil(math.e / epsilon)
        self.d = math.ceil(math.log(1 / delta))
        self.table = [[0] * self.w for _ in range(self.d)]

    def _hash(self, row, x):
        data = f"{row}:{x}".encode()
        digest = hashlib.blake2b(data, digest_size=8).digest()
        return int.from_bytes(digest, "little") % self.w

    def add(self, x, count=1):
        for r in range(self.d):
            c = self._hash(r, x)
            self.table[r][c] += count

    def query(self, x):
        answer = float("inf")

        for r in range(self.d):
            c = self._hash(r, x)
            answer = min(answer, self.table[r][c])

        return answer
```

```id="cms4"
package main

import (
	"encoding/binary"
	"hash/fnv"
	"math"
)

type CountMinSketch struct {
	w     int
	d     int
	table [][]int
}

func NewCountMinSketch(epsilon, delta float64) *CountMinSketch {
	w := int(math.Ceil(math.E / epsilon))
	d := int(math.Ceil(math.Log(1 / delta)))

	table := make([][]int, d)
	for i := range table {
		table[i] = make([]int, w)
	}

	return &CountMinSketch{
		w:     w,
		d:     d,
		table: table,
	}
}

func (s *CountMinSketch) hash(row int, x string) int {
	h := fnv.New64a()

	var b [8]byte
	binary.LittleEndian.PutUint64(b[:], uint64(row))

	h.Write(b[:])
	h.Write([]byte(x))

	return int(h.Sum64() % uint64(s.w))
}

func (s *CountMinSketch) Add(x string, count int) {
	for r := 0; r < s.d; r++ {
		c := s.hash(r, x)
		s.table[r][c] += count
	}
}

func (s *CountMinSketch) Query(x string) int {
	answer := int(^uint(0) >> 1)

	for r := 0; r < s.d; r++ {
		c := s.hash(r, x)
		if s.table[r][c] < answer {
			answer = s.table[r][c]
		}
	}

	return answer
}
```

