19.12 Count-Min Sketch Revisited
A Count-Min Sketch estimates item frequencies in a stream using fixed memory.
19.12 Count-Min Sketch Revisited
A Count-Min Sketch estimates item frequencies in a stream using fixed memory. It answers questions such as:
How many times has this key appeared?
The answer is approximate. Unlike a hash map, a Count-Min Sketch does not store every key. It stores counters arranged in a small table and updates several counters for each observed item.
The key property is one-sided error. A Count-Min Sketch never underestimates a frequency. It may overestimate because unrelated keys can collide into the same counters. This makes it useful when approximate upper bounds are acceptable and memory is limited.
Problem
Suppose you receive a stream of events:
login
search
login
purchase
search
login
The exact frequencies are:
| Item | Count |
|---|---|
| login | 3 |
| search | 2 |
| purchase | 1 |
The exact solution uses a dictionary:
from collections import Counter
counts = Counter()
for event in events:
counts[event] += 1
This works when the number of distinct keys is manageable. It becomes expensive when the stream contains millions or billions of distinct keys.
You want fixed memory while still estimating frequencies.
Solution
Use several hash functions and several rows of counters.
A Count-Min Sketch is a d × w table:
d = depth
w = width
Each row has its own hash function.
To add an item:
- Hash the item once per row.
- Increment the selected counter in each row.
To query an item:
- Hash the item once per row.
- Read the selected counter in each row.
- Return the minimum.
The minimum is used because collisions can only increase counters.
Basic Structure
Example with depth 3 and width 8:
Row 0: 0 0 0 0 0 0 0 0
Row 1: 0 0 0 0 0 0 0 0
Row 2: 0 0 0 0 0 0 0 0
Add:
"login"
Suppose the hash positions are:
Row 0 -> 2
Row 1 -> 5
Row 2 -> 1
After insertion:
Row 0: 0 0 1 0 0 0 0 0
Row 1: 0 0 0 0 0 1 0 0
Row 2: 0 1 0 0 0 0 0 0
Add "login" twice more and those same counters become 3.
A query reads:
3, 3, 3
and returns:
3
Why the Minimum Works
Suppose another item collides with "login" in one row.
Row 0 counter: 7
Row 1 counter: 3
Row 2 counter: 3
The true count is still 3.
The row 0 counter overestimates because of collision noise.
Taking the minimum returns:
min(7, 3, 3) = 3
If every row has some collision noise, the estimate may exceed the true count. It cannot be smaller, because each update for the item increments all of its own counters.
Implementation
This implementation is intentionally compact. It uses independent keyed hashes by prefixing the row number.
import hashlib
class CountMinSketch:
def __init__(self, depth: int, width: int) -> None:
if depth <= 0:
raise ValueError("depth must be positive")
if width <= 0:
raise ValueError("width must be positive")
self.depth = depth
self.width = width
self.table = [
[0] * width
for _ in range(depth)
]
def _hash(self, item: str, row: int) -> int:
digest = hashlib.blake2b(
f"{row}:{item}".encode("utf-8"),
digest_size=8,
).digest()
value = int.from_bytes(digest, "big")
return value % self.width
def add(self, item: str, count: int = 1) -> None:
if count < 0:
raise ValueError("count must be nonnegative")
for row in range(self.depth):
column = self._hash(item, row)
self.table[row][column] += count
def estimate(self, item: str) -> int:
return min(
self.table[row][self._hash(item, row)]
for row in range(self.depth)
)
Usage:
cms = CountMinSketch(depth=5, width=1000)
for event in [
"login",
"search",
"login",
"purchase",
"search",
"login",
]:
cms.add(event)
print(cms.estimate("login"))
print(cms.estimate("search"))
print(cms.estimate("purchase"))
Possible output:
3
2
1
With a small stream and a wide enough table, estimates often match exact counts. With larger streams, estimates may become larger than the true counts.
Weighted Updates
Streams often contain weighted events.
Example:
product_id = "p-17"
quantity = 4
Rather than adding one occurrence, add weight 4.
cms.add("p-17", count=4)
This is useful for:
- Sales quantities
- Network bytes
- Request costs
- Event weights
- Frequency aggregates
The same one-sided error property holds for nonnegative updates.
Error Model
Let:
N = total count of all updates
A Count-Min Sketch estimates a frequency:
f(x)
with an upper bound of roughly:
estimate(x) ≤ f(x) + εN
with probability at least:
1 - δ
The parameters are controlled by table dimensions:
width ≈ e / ε
depth ≈ ln(1 / δ)
Wider tables reduce collision error. More rows reduce the probability that all rows are badly affected.
Choosing Dimensions
Suppose you want error at most about 1% of total stream mass with high probability.
Set:
ε = 0.01
Then:
width ≈ e / 0.01 ≈ 272
For failure probability:
δ = 0.001
use:
depth ≈ ln(1000) ≈ 7
So a sketch with approximately:
7 × 272
counters is already meaningful.
In production, widths are often much larger because memory is cheap relative to the cost of exact per-key state.
Application: Heavy Hitters
A common task is finding frequent items.
Examples:
- Most requested URLs
- Most active users
- Most common search queries
- Most frequent error codes
- Highest-volume network flows
A Count-Min Sketch can estimate frequencies, but it does not directly list the keys. It only answers queries for keys you provide.
A practical heavy-hitter system usually combines:
- Count-Min Sketch for approximate counts.
- Candidate tracking structure for likely heavy keys.
Example:
from collections import Counter
class HeavyHitters:
def __init__(self, depth: int, width: int, candidates: int) -> None:
self.sketch = CountMinSketch(depth, width)
self.candidates = Counter()
self.limit = candidates
def add(self, item: str) -> None:
self.sketch.add(item)
self.candidates[item] = self.sketch.estimate(item)
if len(self.candidates) > self.limit * 2:
self.candidates = Counter(
dict(self.candidates.most_common(self.limit))
)
def top(self, k: int) -> list[tuple[str, int]]:
return [
(item, self.sketch.estimate(item))
for item, _ in self.candidates.most_common(k)
]
This is not the only way to track heavy hitters, but it shows the standard pattern: sketches estimate counts, while a separate structure remembers which keys are worth asking about.
Application: Network Traffic
A router or telemetry collector may observe millions of flows:
source_ip, destination_ip, port, protocol
Exact counting for every flow can consume too much memory.
A Count-Min Sketch can track approximate byte counts:
flow_key = "10.0.0.1:443->10.0.0.9:51520/tcp"
cms.add(flow_key, count=packet_size)
Later:
cms.estimate(flow_key)
returns an approximate byte count for that flow.
This is useful for anomaly detection, rate limiting, and traffic engineering.
Application: Search and Analytics
Search systems can use Count-Min Sketches to estimate:
- Query frequencies
- Term frequencies
- Document feature counts
- Click counts
- User-action counts
Analytics systems use sketches when exact aggregation would require too much memory or too much cross-machine coordination.
Mergeability
Count-Min Sketches are mergeable.
If two sketches use the same dimensions and hash functions, their tables can be added elementwise.
def merge(left: CountMinSketch, right: CountMinSketch) -> CountMinSketch:
if left.depth != right.depth or left.width != right.width:
raise ValueError("sketch dimensions must match")
result = CountMinSketch(left.depth, left.width)
result.table = [
[
left.table[row][column] + right.table[row][column]
for column in range(left.width)
]
for row in range(left.depth)
]
return result
This property is important in distributed systems. Each worker processes a partition of the stream, then sketches are merged to estimate global frequencies.
Comparison with Bloom Filters
| Property | Bloom Filter | Count-Min Sketch |
|---|---|---|
| Question | Is item present? | How many times? |
| Stored value | Bits | Counters |
| False positives | Yes | Overestimates |
| False negatives | No | No underestimates |
| Deletion | Not standard | Not with negative updates unless handled carefully |
| Mergeable | Yes | Yes |
A Bloom filter is a membership sketch. A Count-Min Sketch is a frequency sketch.
Negative Updates
Standard Count-Min Sketch assumes nonnegative updates.
If you allow negative updates:
cms.add("x", count=-1)
the one-sided guarantee breaks.
For streams with increments and decrements, use a different sketching method, such as Count Sketch, or carefully analyze the update model.
Complexity
| Operation | Cost |
|---|---|
| Add | O(depth) |
| Estimate | O(depth) |
| Merge | O(depth × width) |
| Memory | O(depth × width) |
The costs are independent of the number of distinct keys.
Common Mistakes
Do not expect the sketch to enumerate keys. It can estimate counts only for queried keys.
Do not use too narrow a table. Collisions dominate quickly.
Do not use different hash functions across sketches you plan to merge.
Do not allow negative updates unless the sketch design supports them.
Do not interpret estimates as exact counts. They are upper bounds with probabilistic error.
Takeaway
A Count-Min Sketch is a compact, mergeable frequency estimator for large streams. It replaces exact per-key storage with a small table of counters, giving fixed memory and fast updates. Its estimates never fall below the true count, but may exceed it due to collisions. Use it when approximate frequency information is more valuable than exact counts that exceed memory or coordination budgets.