19.11 Bloom Filters Revisited
A Bloom filter is one of the most widely deployed probabilistic data structures in modern computing.
19.11 Bloom Filters Revisited
A Bloom filter is one of the most widely deployed probabilistic data structures in modern computing. It answers a simple question:
Have I seen this item before?
The answer is not exact. A Bloom filter may occasionally report that an item is present when it is not. It never reports that a previously inserted item is absent.
This asymmetry makes Bloom filters useful in systems where false positives are acceptable but false negatives are expensive. Databases use them to avoid unnecessary disk reads. Distributed systems use them to reduce network requests. Storage engines use them to skip files that definitely do not contain a key.
In this section, we revisit Bloom filters from the perspective of randomized algorithms, examining how probability, hashing, and space efficiency combine to create a remarkably effective membership test.
Problem
Suppose you maintain a cache containing:
50 million URLs
Before performing an expensive database lookup, you want to know:
Could this URL exist?
A hash set provides exact answers:
if url in url_set:
...
Unfortunately:
50 million entries
may consume substantial memory.
Can we answer the question using a much smaller structure?
Solution
Store bits instead of keys.
A Bloom filter consists of:
- A bit array.
- Multiple hash functions.
Initially every bit is:
0
Example:
Index:
0 1 2 3 4 5 6 7 8 9
Bits:
0 0 0 0 0 0 0 0 0 0
To insert an item:
- Compute several hash values.
- Set the corresponding bits.
Example
Suppose:
Item = "apple"
Hash results:
h1("apple") = 2
h2("apple") = 5
h3("apple") = 8
Set those bits:
0 0 1 0 0 1 0 0 1 0
Insert:
"banana"
Hash results:
h1("banana") = 1
h2("banana") = 5
h3("banana") = 9
Updated bit array:
0 1 1 0 0 1 0 0 1 1
Notice:
Bit 5 was already set.
This is normal.
Multiple keys may share bits.
Membership Queries
To test whether:
"apple"
exists:
Compute:
2
5
8
Check bits:
1
1
1
All bits are present.
Result:
Possibly present
Now test:
"orange"
Suppose:
h1("orange") = 0
h2("orange") = 3
h3("orange") = 7
Bits:
0
0
0
At least one required bit is absent.
Result:
Definitely not present
This is the key property.
False Positives
Consider:
"grape"
Suppose hashes map to:
1
5
8
All three bits happen to be set by other items.
The filter reports:
Possibly present
even though:
"grape"
was never inserted.
This is a false positive.
Bloom filters allow false positives.
They never allow false negatives.
Why False Negatives Cannot Occur
Suppose:
"apple"
was inserted.
Insertion sets:
2
5
8
to:
1
A standard Bloom filter never clears bits.
Therefore:
2
5
8
remain set forever.
Future queries always observe:
1
1
1
for those positions.
The filter may become less precise over time, but it never forgets an inserted item.
Implementation
A simple implementation:
import hashlib
class BloomFilter:
def __init__(
self,
size: int,
hash_count: int,
):
self.size = size
self.hash_count = hash_count
self.bits = [0] * size
def _hashes(
self,
value: str,
) -> list[int]:
hashes = []
for i in range(self.hash_count):
digest = hashlib.blake2b(
f"{i}:{value}".encode(),
digest_size=8,
).digest()
h = int.from_bytes(
digest,
"big",
)
hashes.append(
h % self.size
)
return hashes
def add(
self,
value: str,
) -> None:
for index in self._hashes(value):
self.bits[index] = 1
def contains(
self,
value: str,
) -> bool:
return all(
self.bits[index]
for index in self._hashes(value)
)
Usage:
bf = BloomFilter(
size=1000,
hash_count=3,
)
bf.add("apple")
bf.add("banana")
print(bf.contains("apple"))
print(bf.contains("orange"))
Output:
True
False
The first answer means:
Possibly present
The second means:
Definitely absent
Understanding Saturation
As more items are inserted:
More bits become 1.
Initially:
0 0 0 0 0 0 0 0
After a few insertions:
0 1 0 1 0 1 0 0
Later:
1 1 1 1 0 1 1 1
Eventually:
1 1 1 1 1 1 1 1
At that point:
Everything appears present.
The filter has lost usefulness.
This phenomenon is called saturation.
Choosing Filter Size
Suppose:
n = expected insertions
m = bit array size
k = number of hash functions
These parameters determine accuracy.
Large:
m
reduces collisions.
Small:
m
increases collisions.
More bits generally mean fewer false positives.
Choosing Hash Count
Too few hash functions:
Bits remain underutilized.
Too many hash functions:
Bits become saturated quickly.
There is an optimal value:
k ≈ (m / n) ln(2)
For practical systems, values between:
4
and
10
are common.
False Positive Probability
Let:
m = number of bits
n = inserted items
k = hash functions
After inserting:
n
items:
Probability a bit remains zero:
(1 - 1/m)^(kn)
Approximation:
e^(-kn/m)
Probability a bit equals one:
1 - e^(-kn/m)
False positive probability:
(1 - e^(-kn/m))^k
This formula guides Bloom filter sizing.
Example Calculation
Suppose:
n = 1,000,000
m = 10,000,000
k = 7
The resulting false positive rate is roughly:
0.8%
Memory usage:
10,000,000 bits
≈
1.25 MB
An exact hash set storing one million strings would require significantly more memory.
This illustrates why Bloom filters are attractive.
Double Hashing
Computing many independent hash functions can be expensive.
A common optimization derives multiple hashes from two base hashes:
h(i)
=
h1 + i × h2
This technique reduces computational cost while preserving good distribution properties.
Many production Bloom filter implementations use double hashing.
Application: Database Storage Engines
Suppose a storage engine maintains many files:
File A
File B
File C
...
Each file contains only a subset of keys.
Without Bloom filters:
Read file metadata.
Possibly read file.
Check key.
Repeated many times.
With Bloom filters:
Query Bloom filter first.
If the filter says:
Definitely absent
the file can be skipped entirely.
This saves disk I/O.
Systems such as entity["software","Apache Cassandra","distributed database"], entity["software","Apache HBase","distributed database"], and entity["software","RocksDB","embedded key-value store"] rely heavily on this technique.
Application: Web Crawlers
A crawler may encounter billions of URLs.
Before fetching:
Have we already visited this URL?
A Bloom filter provides a compact answer.
Occasional false positives are usually acceptable.
Revisiting every URL would be expensive.
Application: Distributed Systems
Suppose node A asks:
Do you have key X?
Instead of transmitting complete key sets, nodes exchange Bloom filters.
Membership tests become:
O(1)
and communication volume decreases dramatically.
Counting Bloom Filters
Standard Bloom filters cannot delete items.
Why?
Because multiple items may share bits.
Clearing a bit could remove evidence for another item.
Counting Bloom filters replace bits with counters:
0 0 0 0
becomes:
0 2 1 0
Insertions increment.
Deletions decrement.
Memory usage increases, but deletion becomes possible.
Comparison with Hash Sets
| Property | Bloom Filter | Hash Set |
|---|---|---|
| Exact membership | No | Yes |
| False positives | Yes | No |
| False negatives | No | No |
| Memory efficiency | Excellent | Moderate |
| Deletion | No* | Yes |
| Mergeability | Easy | Expensive |
* Standard Bloom filter.
Complexity
| Operation | Cost |
|---|---|
| Insert | O(k) |
| Query | O(k) |
| Memory | O(m) |
These costs are independent of the number of stored elements.
This scalability is the primary advantage.
Common Mistakes
Ignoring Expected Capacity
Bloom filters must be sized according to expected insertions.
Underestimating:
n
causes saturation.
Using Poor Hash Functions
Uniform distribution is essential.
Weak hashes increase collision rates.
Treating Positive Results as Certain
A positive answer means:
Possibly present
not:
Definitely present
Expecting Deletion
Standard Bloom filters support insertion only.
Use counting Bloom filters if deletion is required.
Discussion
Bloom filters demonstrate one of the most important ideas in randomized algorithms: exact answers are often unnecessary. By accepting a small, controllable probability of false positives, a Bloom filter achieves dramatic reductions in memory usage while maintaining constant-time operations.
The structure succeeds because the error is one-sided. Negative answers are always correct. Positive answers merely identify candidates for further investigation. This pattern appears repeatedly in large-scale systems: use a cheap probabilistic filter first, then perform expensive exact work only when necessary.
The next section explores Count-Min Sketch, another influential probabilistic data structure. Whereas Bloom filters answer membership questions, Count-Min Sketch estimates frequencies in massive streams while using fixed memory and predictable error bounds.