19.10 HyperLogLog
HyperLogLog estimates the number of distinct elements in a stream using a small, fixed amount of memory.
19.10 HyperLogLog
HyperLogLog estimates the number of distinct elements in a stream using a small, fixed amount of memory. This problem appears constantly in large systems: distinct users, distinct IP addresses, distinct search queries, distinct URLs, distinct products viewed, distinct errors, distinct devices, and distinct sessions.
The exact solution is straightforward: store every distinct value in a set and return the size of the set. That works for small inputs. It breaks down when the stream contains billions of events and the number of distinct values is large. HyperLogLog replaces the exact set with a compact probabilistic sketch.
The result is approximate, but the memory savings are substantial.
Problem
You receive a stream of events:
user-7
user-12
user-7
user-31
user-12
user-99
The exact distinct count is:
4
because the distinct users are:
user-7
user-12
user-31
user-99
For a small stream, this is trivial:
def exact_distinct_count(items: list[str]) -> int:
return len(set(items))
For a large stream, the set may become too large to store.
You want an estimate using fixed memory.
Solution
Hash each item and observe patterns in the hash values.
A good hash function spreads values uniformly across a large bit space. If you hash many distinct inputs, eventually you will see rare bit patterns. HyperLogLog uses the rarity of those patterns to estimate how many distinct items have appeared.
The most important pattern is the number of leading zero bits.
Example 8-bit hash values:
| Hash Bits | Leading Zeros |
|---|---|
10110100 |
0 |
01101011 |
1 |
00101110 |
2 |
00010101 |
3 |
00000110 |
5 |
Seeing many leading zeros is unlikely.
If you observe a hash with 10 leading zeros, that suggests you have probably seen many distinct values. Roughly, one out of every 2^10 uniformly random hash values begins with 10 zero bits.
Single-Register Estimator
Start with one register:
max_zero_count = 0
For each item:
- Hash the item.
- Count leading zeros.
- Store the maximum seen so far.
import hashlib
def hash64(value: str) -> int:
digest = hashlib.blake2b(
value.encode("utf-8"),
digest_size=8,
).digest()
return int.from_bytes(digest, "big")
def leading_zeros_64(x: int) -> int:
if x == 0:
return 64
return 64 - x.bit_length()
def single_register_estimate(items: list[str]) -> int:
max_zero_count = 0
for item in items:
h = hash64(item)
z = leading_zeros_64(h)
max_zero_count = max(max_zero_count, z)
return 2 ** max_zero_count
This estimator is too noisy for production use. One unusually rare hash value can distort the estimate.
HyperLogLog improves stability by using many registers.
Multiple Registers
Split the hash value into two parts:
Register index bits
Remaining bits
The register index chooses which bucket to update. The remaining bits are used to count leading zeros.
Example with m = 16 registers:
64-bit hash:
iiii vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
The first 4 bits select one of 16 registers. The rest determine the zero count.
Each register observes an independent-looking subset of the stream.
Implementation
This small implementation is suitable for learning. Production implementations require additional bias correction and careful hash handling.
import hashlib
import math
class HyperLogLog:
def __init__(self, precision: int = 10) -> None:
if precision < 4 or precision > 16:
raise ValueError("precision should usually be between 4 and 16")
self.precision = precision
self.register_count = 1 << precision
self.registers = [0] * self.register_count
def _hash64(self, value: str) -> int:
digest = hashlib.blake2b(
value.encode("utf-8"),
digest_size=8,
).digest()
return int.from_bytes(digest, "big")
def _rank(self, value: int, width: int) -> int:
if value == 0:
return width + 1
return width - value.bit_length() + 1
def add(self, value: str) -> None:
h = self._hash64(value)
index = h >> (64 - self.precision)
remaining = h & ((1 << (64 - self.precision)) - 1)
rank = self._rank(
remaining,
64 - self.precision,
)
self.registers[index] = max(
self.registers[index],
rank,
)
def estimate(self) -> float:
m = self.register_count
if m == 16:
alpha = 0.673
elif m == 32:
alpha = 0.697
elif m == 64:
alpha = 0.709
else:
alpha = 0.7213 / (1 + 1.079 / m)
indicator_sum = sum(
2.0 ** (-register)
for register in self.registers
)
raw_estimate = alpha * m * m / indicator_sum
empty_registers = self.registers.count(0)
if raw_estimate <= 2.5 * m and empty_registers > 0:
return m * math.log(m / empty_registers)
return raw_estimate
Usage:
hll = HyperLogLog(precision=10)
for user_id in [
"user-7",
"user-12",
"user-7",
"user-31",
"user-12",
"user-99",
]:
hll.add(user_id)
print(round(hll.estimate()))
Possible output:
4
For tiny inputs, approximate sketches can vary. HyperLogLog is most useful when the stream is large.
Choosing Precision
The precision controls the number of registers:
m = 2^p
where p is the precision.
| Precision | Registers | Typical Memory Shape |
|---|---|---|
| 4 | 16 | tiny, noisy |
| 10 | 1,024 | common teaching default |
| 12 | 4,096 | better accuracy |
| 14 | 16,384 | high accuracy |
| 16 | 65,536 | larger, more accurate |
More registers reduce error, but require more memory.
A standard rule of thumb for relative error is approximately:
1.04 / sqrt(m)
So with:
m = 16,384
the relative error is about:
1.04 / 128 ≈ 0.008125
or:
0.8125%
Why the Harmonic Mean Appears
Each register produces a noisy estimate. Some registers overestimate. Some underestimate. A simple average is too sensitive to large outliers.
HyperLogLog combines register values using a harmonic-mean-like calculation:
m² / Σ 2^(-register[i])
This reduces the influence of extreme register values and stabilizes the estimate.
That is the core improvement over a single-register estimator.
Mergeability
One of HyperLogLog’s most useful properties is mergeability.
Suppose one server sees:
A stream
and another server sees:
B stream
Each server maintains its own sketch.
To estimate the distinct count of the union, merge registers by taking the maximum at each position:
def merge(left: HyperLogLog, right: HyperLogLog) -> HyperLogLog:
if left.precision != right.precision:
raise ValueError("precisions must match")
result = HyperLogLog(left.precision)
result.registers = [
max(a, b)
for a, b in zip(left.registers, right.registers)
]
return result
This works because each register stores the maximum rank observed for its subset. The maximum over two sketches is exactly the maximum that would have been observed if both streams had been processed by one sketch.
Mergeability makes HyperLogLog extremely useful in distributed systems.
Application: Distinct Users
A web analytics system may need distinct daily users.
Exact approach:
Store every user ID seen today.
Count unique IDs at midnight.
This can be expensive.
HyperLogLog approach:
Maintain one sketch per day.
Add each user ID to today's sketch.
Estimate at query time.
Memory stays fixed even as traffic grows.
Application: Distributed Analytics
Suppose each region maintains its own sketch:
us-east
eu-west
ap-south
A global distinct estimate can be computed by merging the regional sketches.
This avoids transferring raw user IDs across regions.
It also avoids double-counting users who appear in multiple regions, because the merged sketch estimates the union.
Application: Query Planning
Databases often need approximate cardinalities.
Example:
SELECT COUNT(DISTINCT user_id)
FROM events
WHERE event_date = CURRENT_DATE;
Exact computation may require sorting or hashing many rows.
A precomputed sketch can answer quickly:
estimated distinct users today ≈ 18.4 million
The estimate may be sufficient for dashboards, optimizers, and monitoring.
Complexity
Let:
n = number of events
m = number of registers
| Operation | Cost |
|---|---|
| Add item | O(1) |
| Estimate | O(m) |
| Merge | O(m) |
| Memory | O(m) |
The memory cost is independent of the number of distinct items.
That is the central advantage.
Correctness and Error Model
HyperLogLog does not provide exact correctness. It provides an approximate estimate with predictable statistical error under the assumption that the hash function behaves uniformly.
Important guarantees:
| Property | Meaning |
|---|---|
| No stored elements | Cannot recover original keys |
| Fixed memory | Memory depends on registers, not input size |
| Approximate count | Estimate has statistical error |
| Mergeable | Sketches can estimate unions |
The algorithm counts distinct hash patterns, not distinct original values directly. Hash quality is therefore critical.
Common Mistakes
Do not use HyperLogLog when exact counts are legally, financially, or operationally required.
Do not use a weak hash function. Poor hash distribution invalidates the estimator.
Do not compare sketches with different precision values unless the implementation explicitly supports conversion.
Do not expect accurate results for very small cardinalities unless the implementation includes small-range correction.
Do not use HyperLogLog to list distinct elements. It estimates counts only.
Takeaway
HyperLogLog is the standard example of a scalable probabilistic cardinality estimator. It answers the question “How many distinct values have appeared?” using fixed memory, constant-time updates, and mergeable summaries. It trades exactness for scale, and the trade is often favorable in analytics, monitoring, databases, and distributed systems.