Count Min Sketch is a probabilistic data structure for estimating item frequencies in a stream. It stores counts in a small two dimensional table and uses several hash functions to update and query the table.
It gives approximate answers with one-sided error: the estimated count is never below the true count, but it may be above it because of hash collisions.
Problem
Given a stream
support approximate frequency queries:
without storing a full frequency table.
Algorithm
Use hash functions and a table with rows and columns.
To add an item, increment one counter per row.
count_min_add(C, hashes, x):
for r from 0 to d - 1:
c = hashes[r](x) mod w
C[r][c] = C[r][c] + 1To query an item, compute all matching counters and return the minimum.
count_min_query(C, hashes, x):
answer = infinity
for r from 0 to d - 1:
c = hashes[r](x) mod w
answer = min(answer, C[r][c])
return answerKey Idea
Each counter may include counts from other items that collide with the query item. Therefore each row gives an overestimate. Taking the minimum across rows reduces the collision effect.
Error Bound
With suitable choices of and , Count Min Sketch gives:
with probability at least:
where is the true frequency and is the estimate.
A common parameter choice is:
and
Example
Suppose a stream contains many events:
The true count of is . A Count Min Sketch query for may return or a larger value, depending on collisions. It will not return less than .
Correctness
Each occurrence of item increments exactly one counter in each row used by . Therefore every queried counter includes all occurrences of .
Other items may add extra increments through collisions. Thus every row gives an upper bound on the true count. The minimum of upper bounds is still an upper bound, and usually tighter than any single row.
Complexity
| operation | cost |
|---|---|
| update | |
| query | |
| memory |
The structure is suitable for large streams because memory does not depend on the number of distinct items.
When to Use
Use Count Min Sketch when:
- the stream is too large for exact counting
- queries arrive online
- small overestimates are acceptable
- memory must be fixed in advance
Avoid it when exact counts are required or when underestimates would be preferable.
Implementation
import hashlib
import math
class CountMinSketch:
def __init__(self, epsilon, delta):
self.w = math.ceil(math.e / epsilon)
self.d = math.ceil(math.log(1 / delta))
self.table = [[0] * self.w for _ in range(self.d)]
def _hash(self, row, x):
data = f"{row}:{x}".encode()
digest = hashlib.blake2b(data, digest_size=8).digest()
return int.from_bytes(digest, "little") % self.w
def add(self, x, count=1):
for r in range(self.d):
c = self._hash(r, x)
self.table[r][c] += count
def query(self, x):
answer = float("inf")
for r in range(self.d):
c = self._hash(r, x)
answer = min(answer, self.table[r][c])
return answerpackage main
import (
"encoding/binary"
"hash/fnv"
"math"
)
type CountMinSketch struct {
w int
d int
table [][]int
}
func NewCountMinSketch(epsilon, delta float64) *CountMinSketch {
w := int(math.Ceil(math.E / epsilon))
d := int(math.Ceil(math.Log(1 / delta)))
table := make([][]int, d)
for i := range table {
table[i] = make([]int, w)
}
return &CountMinSketch{
w: w,
d: d,
table: table,
}
}
func (s *CountMinSketch) hash(row int, x string) int {
h := fnv.New64a()
var b [8]byte
binary.LittleEndian.PutUint64(b[:], uint64(row))
h.Write(b[:])
h.Write([]byte(x))
return int(h.Sum64() % uint64(s.w))
}
func (s *CountMinSketch) Add(x string, count int) {
for r := 0; r < s.d; r++ {
c := s.hash(r, x)
s.table[r][c] += count
}
}
func (s *CountMinSketch) Query(x string) int {
answer := int(^uint(0) >> 1)
for r := 0; r < s.d; r++ {
c := s.hash(r, x)
if s.table[r][c] < answer {
answer = s.table[r][c]
}
}
return answer
}