Skip to content

Bloom Filter Membership

Test set membership with a compact probabilistic bit array that allows false positives but no false negatives.

A Bloom filter is a compact probabilistic structure for membership testing. It can answer that a key is definitely absent or possibly present. It may report false positives, but it does not report false negatives when implemented correctly.

You use it when memory is limited and exact storage of all keys is too expensive.

Problem

Given a set SS and a query key kk, determine whether kk may belong to SS.

The result is one of:

  • definitely not present
  • possibly present

Model

A Bloom filter consists of:

  • bit array BB of length mm
  • hash functions h1,h2,,hrh_1, h_2, \dots, h_r

Each hash function maps a key to a bit position:

hj(k)0,1,,m1 h_j(k) \in {0, 1, \dots, m - 1}

To insert a key, set all corresponding bits to 1.

To query a key, check all corresponding bits.

Algorithm

bloom_filter_lookup(B, k):
    for each hash function h_j:
        i = h_j(k)
        if B[i] == 0:
            return DEFINITELY_NOT_PRESENT

    return POSSIBLY_PRESENT

Example

Let the bit array have length m=10m = 10, and suppose a key uses three hash positions:

h1(k)=2,h2(k)=5,h3(k)=8 h_1(k) = 2,\quad h_2(k) = 5,\quad h_3(k) = 8

If the bit array contains:

index0123456789
bit0110010110

then all bits 2, 5, and 8 are set, so the answer is possibly present.

If any one of these bits were 0, the key would be definitely absent.

Correctness

When a key is inserted, every bit selected by its hash functions is set to 1. Therefore, a later lookup for the same key will find all those bits set.

If lookup finds a 0 bit, the key could not have been inserted, because insertion would have set that bit. Therefore the key is definitely absent.

If all bits are 1, the key may have been inserted, but the bits may also have been set by other keys. Therefore the result is only possibly present.

False Positive Rate

For nn inserted keys, mm bits, and rr hash functions, the approximate false positive probability is:

p(1ern/m)r p \approx \left(1 - e^{-rn/m}\right)^r

The optimal number of hash functions is approximately:

rmnln2 r \approx \frac{m}{n}\ln 2

Complexity

operationtime
lookupO(r)O(r)

When rr is a small fixed constant, lookup is constant time.

Space

O(m) O(m)

A Bloom filter is often much smaller than storing the full key set.

When to Use

Bloom filters are appropriate when:

  • memory is limited
  • false positives are acceptable
  • false negatives are unacceptable
  • membership tests happen before expensive lookups
  • deletions are not required

They are common in databases, caches, web crawlers, storage engines, and distributed systems.

Implementation

class BloomFilter:
    def __init__(self, m=1024, r=3):
        self.m = m
        self.r = r
        self.bits = 0

    def _hashes(self, key):
        for seed in range(self.r):
            yield hash((seed, key)) % self.m

    def add(self, key):
        for i in self._hashes(key):
            self.bits |= 1 << i

    def contains(self, key):
        for i in self._hashes(key):
            if ((self.bits >> i) & 1) == 0:
                return False
        return True
type BloomFilter struct {
	Bits []uint64
	M    uint64
	R    uint64
}

func (b *BloomFilter) hash(key string, seed uint64) uint64 {
	var h uint64 = 1469598103934665603 + seed
	for i := 0; i < len(key); i++ {
		h ^= uint64(key[i])
		h *= 1099511628211
	}
	return h % b.M
}

func (b *BloomFilter) Contains(key string) bool {
	for seed := uint64(0); seed < b.R; seed++ {
		i := b.hash(key, seed)
		word := i / 64
		bit := i % 64

		if (b.Bits[word]&(uint64(1)<<bit)) == 0 {
			return false
		}
	}
	return true
}