5.12 Bloom Filters

Problem

You need to test whether a key is in a set using very little memory, and you can tolerate occasional false positives but not false negatives.

A standard hash set stores all keys explicitly. This gives exact answers but uses memory proportional to the number of keys. For large-scale or streaming data, this can be too expensive.

Solution

Use a Bloom filter. It is a bit array combined with multiple hash functions.

BloomFilter:
    bits: array of m bits (initially all zero)
    k: number of hash functions

To insert a key:

insert(filter, key):
    for i from 1 to k:
        h = hash_i(key) mod m
        bits[h] = 1

To query membership:

contains(filter, key):
    for i from 1 to k:
        h = hash_i(key) mod m
        if bits[h] == 0:
            return false

    return true

If any bit is zero, the key is definitely not present. If all bits are one, the key may be present.

Discussion

A Bloom filter compresses a set into a fixed-size bit array. Each key maps to k positions. Insertion sets those bits. Queries check those same positions.

False positives occur when different keys set overlapping bits. A query may find all required bits set even though the key was never inserted. False negatives do not occur because bits are only set, never cleared. If a key was inserted, all its bits remain set.

The error rate depends on three parameters: number of bits m, number of inserted keys n, and number of hash functions k. With too few bits or too many keys, the bit array becomes saturated and almost all queries return true.

The main advantage is space efficiency. A Bloom filter can represent millions of keys using a few megabytes, far less than a full hash set. This makes it useful for prefilters, caches, networking systems, and large-scale data processing.

Correctness

Bloom filters provide one-sided correctness.

If contains(filter, key) returns false, then at least one required bit is zero. Since insertion sets all required bits, the key cannot have been inserted. Therefore, the answer is definitely correct.

If contains(filter, key) returns true, all required bits are one. This can happen either because the key was inserted or because other keys set the same bits. Therefore, the answer is uncertain.

Complexity

Insertion takes O(k) time.

Query takes O(k) time.

Both operations are independent of the number of stored keys.

Space usage is O(m) bits, fixed in advance.

False Positive Rate

After inserting n keys, each bit is zero with probability approximately:

(1 - 1/m)^(k n)

The probability that a query sees all k bits set is approximately:

(1 - (1 - 1/m)^(k n))^k

This is the false positive rate.

For a fixed m and n, the optimal number of hash functions is approximately:

k ≈ (m / n) * ln(2)

Example

Suppose m = 10 bits and k = 2.

Insert keys:

insert("cat") sets bits {2, 7}
insert("dog") sets bits {3, 7}

Bit array:

index: 0 1 2 3 4 5 6 7 8 9
bits:  0 0 1 1 0 0 0 1 0 0

Query "cat":

bits[2] = 1, bits[7] = 1 → maybe present

Query "owl":

bits[2] = 1, bits[3] = 1 → maybe present (false positive)

Query "ant":

bits[4] = 0 → definitely not present

Implementation Notes

Use independent hash functions or derive multiple hashes from a single base hash.

Choose m based on expected number of keys and acceptable false positive rate.

Use a bit array, not a byte array, to save memory.

Avoid clearing individual keys. Standard Bloom filters do not support deletion. Use counting Bloom filters if deletion is required.

For large systems, store the bit array in a contiguous block to benefit from cache locality.

Common Mistakes

Expecting exact membership results.

Using too few bits for the expected number of keys.

Using correlated hash functions, which increases collisions.

Attempting to delete keys without a counting mechanism.

Using Bloom filters when false positives are unacceptable.