Skip to content

Count Min Sketch Query

Estimate item frequencies in a stream using a compact hash based summary.

Count Min Sketch is a probabilistic data structure for estimating item frequencies in a stream. It stores counts in a small two dimensional table and uses several hash functions to update and query the table.

It gives approximate answers with one-sided error: the estimated count is never below the true count, but it may be above it because of hash collisions.

Problem

Given a stream

x1,x2,,xn x_1, x_2, \dots, x_n

support approximate frequency queries:

estimate(x)count(x) \text{estimate}(x) \approx \text{count}(x)

without storing a full frequency table.

Algorithm

Use dd hash functions and a table with dd rows and ww columns.

To add an item, increment one counter per row.

count_min_add(C, hashes, x):
    for r from 0 to d - 1:
        c = hashes[r](x) mod w
        C[r][c] = C[r][c] + 1

To query an item, compute all matching counters and return the minimum.

count_min_query(C, hashes, x):
    answer = infinity

    for r from 0 to d - 1:
        c = hashes[r](x) mod w
        answer = min(answer, C[r][c])

    return answer

Key Idea

Each counter may include counts from other items that collide with the query item. Therefore each row gives an overestimate. Taking the minimum across rows reduces the collision effect.

Error Bound

With suitable choices of ww and dd, Count Min Sketch gives:

f^(x)f(x)+ϵn \hat{f}(x) \le f(x) + \epsilon n

with probability at least:

1δ 1 - \delta

where f(x)f(x) is the true frequency and f^(x)\hat{f}(x) is the estimate.

A common parameter choice is:

w=e/ϵ w = \lceil e / \epsilon \rceil

and

d=ln(1/δ) d = \lceil \ln(1 / \delta) \rceil

Example

Suppose a stream contains many events:

[a,b,a,c,a,d,b,a] [a, b, a, c, a, d, b, a]

The true count of aa is 44. A Count Min Sketch query for aa may return 44 or a larger value, depending on collisions. It will not return less than 44.

Correctness

Each occurrence of item xx increments exactly one counter in each row used by xx. Therefore every queried counter includes all occurrences of xx.

Other items may add extra increments through collisions. Thus every row gives an upper bound on the true count. The minimum of upper bounds is still an upper bound, and usually tighter than any single row.

Complexity

operationcost
updateO(d)O(d)
queryO(d)O(d)
memoryO(wd)O(wd)

The structure is suitable for large streams because memory does not depend on the number of distinct items.

When to Use

Use Count Min Sketch when:

  • the stream is too large for exact counting
  • queries arrive online
  • small overestimates are acceptable
  • memory must be fixed in advance

Avoid it when exact counts are required or when underestimates would be preferable.

Implementation

import hashlib
import math

class CountMinSketch:
    def __init__(self, epsilon, delta):
        self.w = math.ceil(math.e / epsilon)
        self.d = math.ceil(math.log(1 / delta))
        self.table = [[0] * self.w for _ in range(self.d)]

    def _hash(self, row, x):
        data = f"{row}:{x}".encode()
        digest = hashlib.blake2b(data, digest_size=8).digest()
        return int.from_bytes(digest, "little") % self.w

    def add(self, x, count=1):
        for r in range(self.d):
            c = self._hash(r, x)
            self.table[r][c] += count

    def query(self, x):
        answer = float("inf")

        for r in range(self.d):
            c = self._hash(r, x)
            answer = min(answer, self.table[r][c])

        return answer
package main

import (
	"encoding/binary"
	"hash/fnv"
	"math"
)

type CountMinSketch struct {
	w     int
	d     int
	table [][]int
}

func NewCountMinSketch(epsilon, delta float64) *CountMinSketch {
	w := int(math.Ceil(math.E / epsilon))
	d := int(math.Ceil(math.Log(1 / delta)))

	table := make([][]int, d)
	for i := range table {
		table[i] = make([]int, w)
	}

	return &CountMinSketch{
		w:     w,
		d:     d,
		table: table,
	}
}

func (s *CountMinSketch) hash(row int, x string) int {
	h := fnv.New64a()

	var b [8]byte
	binary.LittleEndian.PutUint64(b[:], uint64(row))

	h.Write(b[:])
	h.Write([]byte(x))

	return int(h.Sum64() % uint64(s.w))
}

func (s *CountMinSketch) Add(x string, count int) {
	for r := 0; r < s.d; r++ {
		c := s.hash(r, x)
		s.table[r][c] += count
	}
}

func (s *CountMinSketch) Query(x string) int {
	answer := int(^uint(0) >> 1)

	for r := 0; r < s.d; r++ {
		c := s.hash(r, x)
		if s.table[r][c] < answer {
			answer = s.table[r][c]
		}
	}

	return answer
}