Approximate frequency counting for large key streams using a compact probabilistic data structure with bounded error.
Problem
You need approximate frequency counts for a large stream of keys, and an exact counting map would use too much memory.
A counting map stores one entry per distinct key. This is exact, but memory grows with the number of distinct keys. In a large log stream, network trace, search query stream, or telemetry pipeline, the number of distinct keys may be too large to store directly.
Solution
Use a count-min sketch. It stores a small two-dimensional table of counters and uses several hash functions.
CountMinSketch:
counters: d rows by w columns
d: number of hash functions
w: number of counters per rowTo add one occurrence of a key:
add(sketch, key):
for row from 0 to d - 1:
col = hash_row(row, key) mod w
counters[row][col] = counters[row][col] + 1To estimate the count of a key:
estimate(sketch, key):
answer = infinity
for row from 0 to d - 1:
col = hash_row(row, key) mod w
answer = min(answer, counters[row][col])
return answerThe estimate is the minimum counter among the rows.
Discussion
A count-min sketch compresses a frequency table into a fixed amount of memory. Each key is mapped to one counter in each row. Adding a key increments all of those counters. Querying a key checks the same counters and returns the smallest value.
The estimate never falls below the true count. Every occurrence of the key increments all its selected counters. Collisions with other keys can only add extra increments. Taking the minimum across rows reduces the effect of those collisions.
This makes the structure useful when overestimation is acceptable. Examples include finding heavy hitters, ranking frequent search queries, detecting high-volume IP addresses, and estimating event counts in telemetry streams.
A count-min sketch differs from a Bloom filter. A Bloom filter approximates membership. A count-min sketch approximates frequency. Both use hashing and fixed memory, but their answers have different meanings.
Correctness
Let count(x) be the true number of times key x has been added.
For every row, the counter selected by x is incremented once for each occurrence of x. Therefore, each selected counter is at least count(x).
Other keys may collide with x in a row and add extra increments to that row’s counter. Therefore, each selected counter is:
count(x) + collision_noisewhere collision_noise >= 0.
The query returns the minimum of these counters. Since every selected counter is at least count(x), the estimate is never smaller than the true count. Since the minimum tends to select the row with the least collision noise, it is usually closer to the true count than any single row would be.
Complexity
Let d be the number of rows and w be the number of counters per row.
Insertion takes O(d) time.
Query takes O(d) time.
Space usage is O(d w) counters.
The cost is independent of the number of distinct keys in the stream.
Example
Suppose the sketch has 3 rows and 8 columns.
After processing a stream, key "go" maps to these counters:
row 0, col 2: 18
row 1, col 5: 15
row 2, col 1: 17The estimate is:
min(18, 15, 17) = 15If the true count of "go" is 14, the estimate overcounts by 1. If the true count is 15, the estimate is exact.
Parameter Choice
Increasing w reduces collisions within each row.
Increasing d reduces the probability that every row has significant collision noise.
Use wider rows when the stream contains many distinct keys. Use more rows when you need higher confidence. Both choices increase memory or CPU cost.
A practical implementation often chooses d between 4 and 8, then sets w according to the memory budget.
Implementation Notes
Use independent hash functions or derive row-specific hashes from a strong base hash.
Use counters wide enough for the stream length. Small counters can overflow silently.
For weighted updates, increment by weight instead of 1.
For sliding windows, combine the sketch with time buckets and expire old buckets.
For conservative update, increment only counters currently equal to the minimum estimate. This can reduce overestimation in some workloads.
Common Mistakes
Expecting exact counts.
Forgetting that estimates are biased upward.
Using too few columns for a high-cardinality stream.
Allowing counters to overflow.
Comparing estimates from sketches built with different parameters or hash seeds.
Using count-min sketch when underestimation is safer than overestimation.