19.10 HyperLogLog

HyperLogLog estimates the number of distinct elements in a stream using a small, fixed amount of memory. This problem appears constantly in large systems: distinct users, distinct IP addresses, distinct search queries, distinct URLs, distinct products viewed, distinct errors, distinct devices, and distinct sessions.

The exact solution is straightforward: store every distinct value in a set and return the size of the set. That works for small inputs. It breaks down when the stream contains billions of events and the number of distinct values is large. HyperLogLog replaces the exact set with a compact probabilistic sketch.

The result is approximate, but the memory savings are substantial.

Problem

You receive a stream of events:

user-7
user-12
user-7
user-31
user-12
user-99

The exact distinct count is:

because the distinct users are:

user-7
user-12
user-31
user-99

For a small stream, this is trivial:

def exact_distinct_count(items: list[str]) -> int:
    return len(set(items))

For a large stream, the set may become too large to store.

You want an estimate using fixed memory.

Solution

Hash each item and observe patterns in the hash values.

A good hash function spreads values uniformly across a large bit space. If you hash many distinct inputs, eventually you will see rare bit patterns. HyperLogLog uses the rarity of those patterns to estimate how many distinct items have appeared.

The most important pattern is the number of leading zero bits.

Example 8-bit hash values:

Hash Bits	Leading Zeros
`10110100`	0
`01101011`	1
`00101110`	2
`00010101`	3
`00000110`	5

Seeing many leading zeros is unlikely.

If you observe a hash with 10 leading zeros, that suggests you have probably seen many distinct values. Roughly, one out of every 2^10 uniformly random hash values begins with 10 zero bits.

Single-Register Estimator

Start with one register:

max_zero_count = 0

For each item:

Hash the item.
Count leading zeros.
Store the maximum seen so far.

import hashlib

def hash64(value: str) -> int:
    digest = hashlib.blake2b(
        value.encode("utf-8"),
        digest_size=8,
    ).digest()

    return int.from_bytes(digest, "big")

def leading_zeros_64(x: int) -> int:
    if x == 0:
        return 64

    return 64 - x.bit_length()

def single_register_estimate(items: list[str]) -> int:
    max_zero_count = 0

    for item in items:
        h = hash64(item)
        z = leading_zeros_64(h)
        max_zero_count = max(max_zero_count, z)

    return 2 ** max_zero_count

This estimator is too noisy for production use. One unusually rare hash value can distort the estimate.

HyperLogLog improves stability by using many registers.

Multiple Registers

Split the hash value into two parts:

Register index bits
Remaining bits

The register index chooses which bucket to update. The remaining bits are used to count leading zeros.

Example with m = 16 registers:

64-bit hash:
iiii vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

The first 4 bits select one of 16 registers. The rest determine the zero count.

Each register observes an independent-looking subset of the stream.

Implementation

This small implementation is suitable for learning. Production implementations require additional bias correction and careful hash handling.

import hashlib
import math

class HyperLogLog:
    def __init__(self, precision: int = 10) -> None:
        if precision < 4 or precision > 16:
            raise ValueError("precision should usually be between 4 and 16")

        self.precision = precision
        self.register_count = 1 << precision
        self.registers = [0] * self.register_count

    def _hash64(self, value: str) -> int:
        digest = hashlib.blake2b(
            value.encode("utf-8"),
            digest_size=8,
        ).digest()

        return int.from_bytes(digest, "big")

    def _rank(self, value: int, width: int) -> int:
        if value == 0:
            return width + 1

        return width - value.bit_length() + 1

    def add(self, value: str) -> None:
        h = self._hash64(value)

        index = h >> (64 - self.precision)
        remaining = h & ((1 << (64 - self.precision)) - 1)

        rank = self._rank(
            remaining,
            64 - self.precision,
        )

        self.registers[index] = max(
            self.registers[index],
            rank,
        )

    def estimate(self) -> float:
        m = self.register_count

        if m == 16:
            alpha = 0.673
        elif m == 32:
            alpha = 0.697
        elif m == 64:
            alpha = 0.709
        else:
            alpha = 0.7213 / (1 + 1.079 / m)

        indicator_sum = sum(
            2.0 ** (-register)
            for register in self.registers
        )

        raw_estimate = alpha * m * m / indicator_sum

        empty_registers = self.registers.count(0)

        if raw_estimate <= 2.5 * m and empty_registers > 0:
            return m * math.log(m / empty_registers)

        return raw_estimate

Usage:

hll = HyperLogLog(precision=10)

for user_id in [
    "user-7",
    "user-12",
    "user-7",
    "user-31",
    "user-12",
    "user-99",
]:
    hll.add(user_id)

print(round(hll.estimate()))

Possible output:

For tiny inputs, approximate sketches can vary. HyperLogLog is most useful when the stream is large.

Choosing Precision

The precision controls the number of registers:

m = 2^p

where p is the precision.

Precision	Registers	Typical Memory Shape
4	16	tiny, noisy
10	1,024	common teaching default
12	4,096	better accuracy
14	16,384	high accuracy
16	65,536	larger, more accurate

More registers reduce error, but require more memory.

A standard rule of thumb for relative error is approximately:

1.04 / sqrt(m)

So with:

m = 16,384

the relative error is about:

1.04 / 128 ≈ 0.008125

or:

0.8125%

Why the Harmonic Mean Appears

Each register produces a noisy estimate. Some registers overestimate. Some underestimate. A simple average is too sensitive to large outliers.

HyperLogLog combines register values using a harmonic-mean-like calculation:

m² / Σ 2^(-register[i])

This reduces the influence of extreme register values and stabilizes the estimate.

That is the core improvement over a single-register estimator.

Mergeability

One of HyperLogLog’s most useful properties is mergeability.

Suppose one server sees:

A stream

and another server sees:

B stream

Each server maintains its own sketch.

To estimate the distinct count of the union, merge registers by taking the maximum at each position:

def merge(left: HyperLogLog, right: HyperLogLog) -> HyperLogLog:
    if left.precision != right.precision:
        raise ValueError("precisions must match")

    result = HyperLogLog(left.precision)

    result.registers = [
        max(a, b)
        for a, b in zip(left.registers, right.registers)
    ]

    return result

This works because each register stores the maximum rank observed for its subset. The maximum over two sketches is exactly the maximum that would have been observed if both streams had been processed by one sketch.

Mergeability makes HyperLogLog extremely useful in distributed systems.

Application: Distinct Users

A web analytics system may need distinct daily users.

Exact approach:

Store every user ID seen today.
Count unique IDs at midnight.

This can be expensive.

HyperLogLog approach:

Maintain one sketch per day.
Add each user ID to today's sketch.
Estimate at query time.

Memory stays fixed even as traffic grows.

Application: Distributed Analytics

Suppose each region maintains its own sketch:

us-east
eu-west
ap-south

A global distinct estimate can be computed by merging the regional sketches.

This avoids transferring raw user IDs across regions.

It also avoids double-counting users who appear in multiple regions, because the merged sketch estimates the union.

Application: Query Planning

Databases often need approximate cardinalities.

Example:

SELECT COUNT(DISTINCT user_id)
FROM events
WHERE event_date = CURRENT_DATE;

Exact computation may require sorting or hashing many rows.

A precomputed sketch can answer quickly:

estimated distinct users today ≈ 18.4 million

The estimate may be sufficient for dashboards, optimizers, and monitoring.

Complexity

Let:

n = number of events
m = number of registers

Operation	Cost
Add item	O(1)
Estimate	O(m)
Merge	O(m)
Memory	O(m)

The memory cost is independent of the number of distinct items.

That is the central advantage.

Correctness and Error Model

HyperLogLog does not provide exact correctness. It provides an approximate estimate with predictable statistical error under the assumption that the hash function behaves uniformly.

Important guarantees:

Property	Meaning
No stored elements	Cannot recover original keys
Fixed memory	Memory depends on registers, not input size
Approximate count	Estimate has statistical error
Mergeable	Sketches can estimate unions

The algorithm counts distinct hash patterns, not distinct original values directly. Hash quality is therefore critical.

Common Mistakes

Do not use HyperLogLog when exact counts are legally, financially, or operationally required.

Do not use a weak hash function. Poor hash distribution invalidates the estimator.

Do not compare sketches with different precision values unless the implementation explicitly supports conversion.

Do not expect accurate results for very small cardinalities unless the implementation includes small-range correction.

Do not use HyperLogLog to list distinct elements. It estimates counts only.

Takeaway

HyperLogLog is the standard example of a scalable probabilistic cardinality estimator. It answers the question “How many distinct values have appeared?” using fixed memory, constant-time updates, and mergeable summaries. It trades exactness for scale, and the trade is often favorable in analytics, monitoring, databases, and distributed systems.