19.12 Count-Min Sketch Revisited

A Count-Min Sketch estimates item frequencies in a stream using fixed memory.

19.12 Count-Min Sketch Revisited

A Count-Min Sketch estimates item frequencies in a stream using fixed memory. It answers questions such as:

How many times has this key appeared?

The answer is approximate. Unlike a hash map, a Count-Min Sketch does not store every key. It stores counters arranged in a small table and updates several counters for each observed item.

The key property is one-sided error. A Count-Min Sketch never underestimates a frequency. It may overestimate because unrelated keys can collide into the same counters. This makes it useful when approximate upper bounds are acceptable and memory is limited.

Problem

Suppose you receive a stream of events:

login
search
login
purchase
search
login

The exact frequencies are:

Item Count
login 3
search 2
purchase 1

The exact solution uses a dictionary:

from collections import Counter

counts = Counter()

for event in events:
    counts[event] += 1

This works when the number of distinct keys is manageable. It becomes expensive when the stream contains millions or billions of distinct keys.

You want fixed memory while still estimating frequencies.

Solution

Use several hash functions and several rows of counters.

A Count-Min Sketch is a d × w table:

d = depth
w = width

Each row has its own hash function.

To add an item:

  1. Hash the item once per row.
  2. Increment the selected counter in each row.

To query an item:

  1. Hash the item once per row.
  2. Read the selected counter in each row.
  3. Return the minimum.

The minimum is used because collisions can only increase counters.

Basic Structure

Example with depth 3 and width 8:

Row 0: 0 0 0 0 0 0 0 0
Row 1: 0 0 0 0 0 0 0 0
Row 2: 0 0 0 0 0 0 0 0

Add:

"login"

Suppose the hash positions are:

Row 0 -> 2
Row 1 -> 5
Row 2 -> 1

After insertion:

Row 0: 0 0 1 0 0 0 0 0
Row 1: 0 0 0 0 0 1 0 0
Row 2: 0 1 0 0 0 0 0 0

Add "login" twice more and those same counters become 3.

A query reads:

3, 3, 3

and returns:

3

Why the Minimum Works

Suppose another item collides with "login" in one row.

Row 0 counter: 7
Row 1 counter: 3
Row 2 counter: 3

The true count is still 3.

The row 0 counter overestimates because of collision noise.

Taking the minimum returns:

min(7, 3, 3) = 3

If every row has some collision noise, the estimate may exceed the true count. It cannot be smaller, because each update for the item increments all of its own counters.

Implementation

This implementation is intentionally compact. It uses independent keyed hashes by prefixing the row number.

import hashlib

class CountMinSketch:
    def __init__(self, depth: int, width: int) -> None:
        if depth <= 0:
            raise ValueError("depth must be positive")
        if width <= 0:
            raise ValueError("width must be positive")

        self.depth = depth
        self.width = width
        self.table = [
            [0] * width
            for _ in range(depth)
        ]

    def _hash(self, item: str, row: int) -> int:
        digest = hashlib.blake2b(
            f"{row}:{item}".encode("utf-8"),
            digest_size=8,
        ).digest()

        value = int.from_bytes(digest, "big")
        return value % self.width

    def add(self, item: str, count: int = 1) -> None:
        if count < 0:
            raise ValueError("count must be nonnegative")

        for row in range(self.depth):
            column = self._hash(item, row)
            self.table[row][column] += count

    def estimate(self, item: str) -> int:
        return min(
            self.table[row][self._hash(item, row)]
            for row in range(self.depth)
        )

Usage:

cms = CountMinSketch(depth=5, width=1000)

for event in [
    "login",
    "search",
    "login",
    "purchase",
    "search",
    "login",
]:
    cms.add(event)

print(cms.estimate("login"))
print(cms.estimate("search"))
print(cms.estimate("purchase"))

Possible output:

3
2
1

With a small stream and a wide enough table, estimates often match exact counts. With larger streams, estimates may become larger than the true counts.

Weighted Updates

Streams often contain weighted events.

Example:

product_id = "p-17"
quantity = 4

Rather than adding one occurrence, add weight 4.

cms.add("p-17", count=4)

This is useful for:

  • Sales quantities
  • Network bytes
  • Request costs
  • Event weights
  • Frequency aggregates

The same one-sided error property holds for nonnegative updates.

Error Model

Let:

N = total count of all updates

A Count-Min Sketch estimates a frequency:

f(x)

with an upper bound of roughly:

estimate(x) ≤ f(x) + εN

with probability at least:

1 - δ

The parameters are controlled by table dimensions:

width  ≈ e / ε
depth  ≈ ln(1 / δ)

Wider tables reduce collision error. More rows reduce the probability that all rows are badly affected.

Choosing Dimensions

Suppose you want error at most about 1% of total stream mass with high probability.

Set:

ε = 0.01

Then:

width ≈ e / 0.01 ≈ 272

For failure probability:

δ = 0.001

use:

depth ≈ ln(1000) ≈ 7

So a sketch with approximately:

7 × 272

counters is already meaningful.

In production, widths are often much larger because memory is cheap relative to the cost of exact per-key state.

Application: Heavy Hitters

A common task is finding frequent items.

Examples:

  • Most requested URLs
  • Most active users
  • Most common search queries
  • Most frequent error codes
  • Highest-volume network flows

A Count-Min Sketch can estimate frequencies, but it does not directly list the keys. It only answers queries for keys you provide.

A practical heavy-hitter system usually combines:

  1. Count-Min Sketch for approximate counts.
  2. Candidate tracking structure for likely heavy keys.

Example:

from collections import Counter

class HeavyHitters:
    def __init__(self, depth: int, width: int, candidates: int) -> None:
        self.sketch = CountMinSketch(depth, width)
        self.candidates = Counter()
        self.limit = candidates

    def add(self, item: str) -> None:
        self.sketch.add(item)
        self.candidates[item] = self.sketch.estimate(item)

        if len(self.candidates) > self.limit * 2:
            self.candidates = Counter(
                dict(self.candidates.most_common(self.limit))
            )

    def top(self, k: int) -> list[tuple[str, int]]:
        return [
            (item, self.sketch.estimate(item))
            for item, _ in self.candidates.most_common(k)
        ]

This is not the only way to track heavy hitters, but it shows the standard pattern: sketches estimate counts, while a separate structure remembers which keys are worth asking about.

Application: Network Traffic

A router or telemetry collector may observe millions of flows:

source_ip, destination_ip, port, protocol

Exact counting for every flow can consume too much memory.

A Count-Min Sketch can track approximate byte counts:

flow_key = "10.0.0.1:443->10.0.0.9:51520/tcp"
cms.add(flow_key, count=packet_size)

Later:

cms.estimate(flow_key)

returns an approximate byte count for that flow.

This is useful for anomaly detection, rate limiting, and traffic engineering.

Application: Search and Analytics

Search systems can use Count-Min Sketches to estimate:

  • Query frequencies
  • Term frequencies
  • Document feature counts
  • Click counts
  • User-action counts

Analytics systems use sketches when exact aggregation would require too much memory or too much cross-machine coordination.

Mergeability

Count-Min Sketches are mergeable.

If two sketches use the same dimensions and hash functions, their tables can be added elementwise.

def merge(left: CountMinSketch, right: CountMinSketch) -> CountMinSketch:
    if left.depth != right.depth or left.width != right.width:
        raise ValueError("sketch dimensions must match")

    result = CountMinSketch(left.depth, left.width)

    result.table = [
        [
            left.table[row][column] + right.table[row][column]
            for column in range(left.width)
        ]
        for row in range(left.depth)
    ]

    return result

This property is important in distributed systems. Each worker processes a partition of the stream, then sketches are merged to estimate global frequencies.

Comparison with Bloom Filters

Property Bloom Filter Count-Min Sketch
Question Is item present? How many times?
Stored value Bits Counters
False positives Yes Overestimates
False negatives No No underestimates
Deletion Not standard Not with negative updates unless handled carefully
Mergeable Yes Yes

A Bloom filter is a membership sketch. A Count-Min Sketch is a frequency sketch.

Negative Updates

Standard Count-Min Sketch assumes nonnegative updates.

If you allow negative updates:

cms.add("x", count=-1)

the one-sided guarantee breaks.

For streams with increments and decrements, use a different sketching method, such as Count Sketch, or carefully analyze the update model.

Complexity

Operation Cost
Add O(depth)
Estimate O(depth)
Merge O(depth × width)
Memory O(depth × width)

The costs are independent of the number of distinct keys.

Common Mistakes

Do not expect the sketch to enumerate keys. It can estimate counts only for queried keys.

Do not use too narrow a table. Collisions dominate quickly.

Do not use different hash functions across sketches you plan to merge.

Do not allow negative updates unless the sketch design supports them.

Do not interpret estimates as exact counts. They are upper bounds with probabilistic error.

Takeaway

A Count-Min Sketch is a compact, mergeable frequency estimator for large streams. It replaces exact per-key storage with a small table of counters, giving fixed memory and fast updates. Its estimates never fall below the true count, but may exceed it due to collisions. Use it when approximate frequency information is more valuable than exact counts that exceed memory or coordination budgets.