Bloom Filter Membership

A Bloom filter is a compact probabilistic structure for membership testing. It can answer that a key is definitely absent or possibly present. It may report false positives, but it does not report false negatives when implemented correctly.

You use it when memory is limited and exact storage of all keys is too expensive.

Problem

Given a set $S$ and a query key $k$, determine whether $k$ may belong to $S$.

The result is one of:

definitely not present
possibly present

Model

A Bloom filter consists of:

bit array $B$ of length $m$
hash functions $h_1, h_2, \dots, h_r$

Each hash function maps a key to a bit position:

$$ h_j(k) \in {0, 1, \dots, m - 1} $$

To insert a key, set all corresponding bits to 1.

To query a key, check all corresponding bits.

Algorithm

bloom_filter_lookup(B, k):
    for each hash function h_j:
        i = h_j(k)
        if B[i] == 0:
            return DEFINITELY_NOT_PRESENT

    return POSSIBLY_PRESENT

Example

Let the bit array have length $m = 10$, and suppose a key uses three hash positions:

$$ h_1(k) = 2,\quad h_2(k) = 5,\quad h_3(k) = 8 $$

If the bit array contains:

index	0	1	2	3	4	5	6	7	8	9
bit	0	1	1	0	0	1	0	1	1	0

then all bits 2, 5, and 8 are set, so the answer is possibly present.

If any one of these bits were 0, the key would be definitely absent.

Correctness

When a key is inserted, every bit selected by its hash functions is set to 1. Therefore, a later lookup for the same key will find all those bits set.

If lookup finds a 0 bit, the key could not have been inserted, because insertion would have set that bit. Therefore the key is definitely absent.

If all bits are 1, the key may have been inserted, but the bits may also have been set by other keys. Therefore the result is only possibly present.

False Positive Rate

For $n$ inserted keys, $m$ bits, and $r$ hash functions, the approximate false positive probability is:

$$ p \approx \left(1 - e^{-rn/m}\right)^r $$

The optimal number of hash functions is approximately:

$$ r \approx \frac{m}{n}\ln 2 $$

Complexity

operation	time
lookup	$O(r)$

When $r$ is a small fixed constant, lookup is constant time.

Space

$$ O(m) $$

A Bloom filter is often much smaller than storing the full key set.

When to Use

Bloom filters are appropriate when:

memory is limited
false positives are acceptable
false negatives are unacceptable
membership tests happen before expensive lookups
deletions are not required

They are common in databases, caches, web crawlers, storage engines, and distributed systems.

Implementation

class BloomFilter:
    def __init__(self, m=1024, r=3):
        self.m = m
        self.r = r
        self.bits = 0

    def _hashes(self, key):
        for seed in range(self.r):
            yield hash((seed, key)) % self.m

    def add(self, key):
        for i in self._hashes(key):
            self.bits |= 1 << i

    def contains(self, key):
        for i in self._hashes(key):
            if ((self.bits >> i) & 1) == 0:
                return False
        return True

type BloomFilter struct {
	Bits []uint64
	M    uint64
	R    uint64
}

func (b *BloomFilter) hash(key string, seed uint64) uint64 {
	var h uint64 = 1469598103934665603 + seed
	for i := 0; i < len(key); i++ {
		h ^= uint64(key[i])
		h *= 1099511628211
	}
	return h % b.M
}

func (b *BloomFilter) Contains(key string) bool {
	for seed := uint64(0); seed < b.R; seed++ {
		i := b.hash(key, seed)
		word := i / 64
		bit := i % 64

		if (b.Bits[word]&(uint64(1)<<bit)) == 0 {
			return false
		}
	}
	return true
}