19.11 Bloom Filters Revisited

A Bloom filter is one of the most widely deployed probabilistic data structures in modern computing. It answers a simple question:

Have I seen this item before?

The answer is not exact. A Bloom filter may occasionally report that an item is present when it is not. It never reports that a previously inserted item is absent.

This asymmetry makes Bloom filters useful in systems where false positives are acceptable but false negatives are expensive. Databases use them to avoid unnecessary disk reads. Distributed systems use them to reduce network requests. Storage engines use them to skip files that definitely do not contain a key.

In this section, we revisit Bloom filters from the perspective of randomized algorithms, examining how probability, hashing, and space efficiency combine to create a remarkably effective membership test.

Problem

Suppose you maintain a cache containing:

50 million URLs

Before performing an expensive database lookup, you want to know:

Could this URL exist?

A hash set provides exact answers:

if url in url_set:
    ...

Unfortunately:

50 million entries

may consume substantial memory.

Can we answer the question using a much smaller structure?

Solution

Store bits instead of keys.

A Bloom filter consists of:

A bit array.
Multiple hash functions.

Initially every bit is:

Example:

Index:

0 1 2 3 4 5 6 7 8 9

Bits:

0 0 0 0 0 0 0 0 0 0

To insert an item:

Compute several hash values.
Set the corresponding bits.

Example

Suppose:

Item = "apple"

Hash results:

h1("apple") = 2
h2("apple") = 5
h3("apple") = 8

Set those bits:

0 0 1 0 0 1 0 0 1 0

Insert:

"banana"

Hash results:

h1("banana") = 1
h2("banana") = 5
h3("banana") = 9

Updated bit array:

0 1 1 0 0 1 0 0 1 1

Notice:

Bit 5 was already set.

This is normal.

Multiple keys may share bits.

Membership Queries

To test whether:

"apple"

exists:

Compute:

2
5
8

Check bits:

1
1
1

All bits are present.

Result:

Possibly present

Now test:

"orange"

Suppose:

h1("orange") = 0
h2("orange") = 3
h3("orange") = 7

Bits:

0
0
0

At least one required bit is absent.

Result:

Definitely not present

This is the key property.

False Positives

Consider:

"grape"

Suppose hashes map to:

1
5
8

All three bits happen to be set by other items.

The filter reports:

Possibly present

even though:

"grape"

was never inserted.

This is a false positive.

Bloom filters allow false positives.

They never allow false negatives.

Why False Negatives Cannot Occur

Suppose:

"apple"

was inserted.

Insertion sets:

2
5
8

to:

A standard Bloom filter never clears bits.

Therefore:

2
5
8

remain set forever.

Future queries always observe:

1
1
1

for those positions.

The filter may become less precise over time, but it never forgets an inserted item.

Implementation

A simple implementation:

import hashlib

class BloomFilter:
    def __init__(
        self,
        size: int,
        hash_count: int,
    ):
        self.size = size
        self.hash_count = hash_count
        self.bits = [0] * size

    def _hashes(
        self,
        value: str,
    ) -> list[int]:

        hashes = []

        for i in range(self.hash_count):
            digest = hashlib.blake2b(
                f"{i}:{value}".encode(),
                digest_size=8,
            ).digest()

            h = int.from_bytes(
                digest,
                "big",
            )

            hashes.append(
                h % self.size
            )

        return hashes

    def add(
        self,
        value: str,
    ) -> None:

        for index in self._hashes(value):
            self.bits[index] = 1

    def contains(
        self,
        value: str,
    ) -> bool:

        return all(
            self.bits[index]
            for index in self._hashes(value)
        )

Usage:

bf = BloomFilter(
    size=1000,
    hash_count=3,
)

bf.add("apple")
bf.add("banana")

print(bf.contains("apple"))
print(bf.contains("orange"))

Output:

True
False

The first answer means:

Possibly present

The second means:

Definitely absent

Understanding Saturation

As more items are inserted:

More bits become 1.

Initially:

0 0 0 0 0 0 0 0

After a few insertions:

0 1 0 1 0 1 0 0

Later:

1 1 1 1 0 1 1 1

Eventually:

1 1 1 1 1 1 1 1

At that point:

Everything appears present.

The filter has lost usefulness.

This phenomenon is called saturation.

Choosing Filter Size

Suppose:

n = expected insertions
m = bit array size
k = number of hash functions

These parameters determine accuracy.

Large:

reduces collisions.

Small:

increases collisions.

More bits generally mean fewer false positives.

Choosing Hash Count

Too few hash functions:

Bits remain underutilized.

Too many hash functions:

Bits become saturated quickly.

There is an optimal value:

k ≈ (m / n) ln(2)

For practical systems, values between:

and

are common.

False Positive Probability

Let:

m = number of bits
n = inserted items
k = hash functions

After inserting:

items:

Probability a bit remains zero:

(1 - 1/m)^(kn)

Approximation:

e^(-kn/m)

Probability a bit equals one:

1 - e^(-kn/m)

False positive probability:

(1 - e^(-kn/m))^k

This formula guides Bloom filter sizing.

Example Calculation

Suppose:

n = 1,000,000
m = 10,000,000
k = 7

The resulting false positive rate is roughly:

0.8%

Memory usage:

10,000,000 bits
≈
1.25 MB

An exact hash set storing one million strings would require significantly more memory.

This illustrates why Bloom filters are attractive.

Double Hashing

Computing many independent hash functions can be expensive.

A common optimization derives multiple hashes from two base hashes:

h(i)
=
h1 + i × h2

This technique reduces computational cost while preserving good distribution properties.

Many production Bloom filter implementations use double hashing.

Application: Database Storage Engines

Suppose a storage engine maintains many files:

File A
File B
File C
...

Each file contains only a subset of keys.

Without Bloom filters:

Read file metadata.
Possibly read file.
Check key.

Repeated many times.

With Bloom filters:

Query Bloom filter first.

If the filter says:

Definitely absent

the file can be skipped entirely.

This saves disk I/O.

Systems such as entity["software","Apache Cassandra","distributed database"], entity["software","Apache HBase","distributed database"], and entity["software","RocksDB","embedded key-value store"] rely heavily on this technique.

Application: Web Crawlers

A crawler may encounter billions of URLs.

Before fetching:

Have we already visited this URL?

A Bloom filter provides a compact answer.

Occasional false positives are usually acceptable.

Revisiting every URL would be expensive.

Application: Distributed Systems

Suppose node A asks:

Do you have key X?

Instead of transmitting complete key sets, nodes exchange Bloom filters.

Membership tests become:

O(1)

and communication volume decreases dramatically.

Counting Bloom Filters

Standard Bloom filters cannot delete items.

Why?

Because multiple items may share bits.

Clearing a bit could remove evidence for another item.

Counting Bloom filters replace bits with counters:

0 0 0 0

becomes:

0 2 1 0

Insertions increment.

Deletions decrement.

Memory usage increases, but deletion becomes possible.

Comparison with Hash Sets

Property	Bloom Filter	Hash Set
Exact membership	No	Yes
False positives	Yes	No
False negatives	No	No
Memory efficiency	Excellent	Moderate
Deletion	No*	Yes
Mergeability	Easy	Expensive

* Standard Bloom filter.

Complexity

Operation	Cost
Insert	O(k)
Query	O(k)
Memory	O(m)

These costs are independent of the number of stored elements.

This scalability is the primary advantage.

Common Mistakes

Ignoring Expected Capacity

Bloom filters must be sized according to expected insertions.

Underestimating:

causes saturation.

Using Poor Hash Functions

Uniform distribution is essential.

Weak hashes increase collision rates.

Treating Positive Results as Certain

A positive answer means:

Possibly present

not:

Definitely present

Expecting Deletion

Standard Bloom filters support insertion only.

Use counting Bloom filters if deletion is required.

Discussion

Bloom filters demonstrate one of the most important ideas in randomized algorithms: exact answers are often unnecessary. By accepting a small, controllable probability of false positives, a Bloom filter achieves dramatic reductions in memory usage while maintaining constant-time operations.

The structure succeeds because the error is one-sided. Negative answers are always correct. Positive answers merely identify candidates for further investigation. This pattern appears repeatedly in large-scale systems: use a cheap probabilistic filter first, then perform expensive exact work only when necessary.

The next section explores Count-Min Sketch, another influential probabilistic data structure. Whereas Bloom filters answer membership questions, Count-Min Sketch estimates frequencies in massive streams while using fixed memory and predictable error bounds.