5.3 Collision Handling

Problem

Different keys can produce the same bucket index. A hash table must store and retrieve all such keys without losing correctness or degrading excessively in performance.

Solution

Use a collision resolution strategy. The two standard approaches are separate chaining and open addressing.

Separate chaining

Each bucket stores a small collection of entries.

bucket[i] = list of (key, value)

Lookup scans only the selected bucket.

lookup(table, key):
    i = hash(key) mod capacity
    for entry in bucket[i]:
        if entry.key == key:
            return entry.value
    return not_found

Insertion appends to the bucket if the key is new.

Open addressing

All entries live in the table array. When a collision occurs, probe alternative slots.

index = hash(key) mod capacity

If the slot is occupied, move to the next candidate according to a probe rule.

Linear probing:

probe(i, k) = (i + k) mod capacity

Lookup continues probing until it finds the key or an empty slot.

lookup(table, key):
    i = hash(key) mod capacity
    k = 0

    while slot(probe(i, k)) is occupied:
        if slot.key == key:
            return slot.value
        k = k + 1

    return not_found

Deletion in open addressing uses a special marker (tombstone) instead of clearing the slot.

Discussion

Collision handling determines the real behavior of a hash table. The choice affects memory layout, cache locality, deletion semantics, and worst case degradation.

Separate chaining keeps collisions local to each bucket. The cost of a lookup equals the cost of hashing plus scanning one bucket. Buckets can grow independently, so performance depends on the distribution of keys. This approach tolerates higher load factors because buckets expand dynamically. It also simplifies deletion, since removing an entry from a bucket does not affect others.

Open addressing keeps all entries in a contiguous array. This improves cache locality because probing touches nearby memory. For workloads dominated by lookups, this can outperform chaining even with slightly higher collision rates. However, performance becomes sensitive to the load factor. As the table fills, probe sequences grow longer, and clusters form. Deletion becomes more complex because removing an entry must not break probe chains, which is why tombstones are used.

Clustering is a key phenomenon in open addressing. Linear probing tends to create primary clusters, where consecutive occupied slots attract more insertions. Alternative probe schemes such as quadratic probing or double hashing reduce clustering but introduce their own trade offs in implementation complexity and memory access patterns.

Correctness

For separate chaining, correctness follows from the bucket invariant. Every key is stored in exactly one bucket determined by the hash function and table capacity. Lookup inspects that bucket and compares keys directly, so it finds the entry if it exists.

For open addressing, correctness depends on preserving probe sequences. When inserting a key, the algorithm follows a probe sequence until it finds an empty slot. Lookup must follow the same sequence and must not stop prematurely. This is why an empty slot signals that the key is absent, while a tombstone signals that probing must continue.

Deletion must preserve this invariant. If deletion simply clears a slot, it may break a probe chain. A later lookup could stop at that empty slot and incorrectly conclude that a key is absent. Tombstones avoid this by marking the slot as previously occupied, allowing lookup to continue probing.

Complexity

Let the load factor be α = size / capacity.

For separate chaining, the expected bucket size is proportional to α. Lookup, insertion, and deletion take expected time O(1 + α). If α is kept bounded by a constant, operations are expected constant time.

For open addressing, expected probe length depends more sharply on α. With linear probing, average probe length grows as the table approaches capacity. When α remains below a threshold such as 0.7, operations are still close to constant time. As α approaches 1, performance degrades rapidly.

Worst case behavior for both strategies is O(n). This occurs when all keys collide or when probe sequences span most of the table.

Implementation Notes

Separate chaining benefits from compact bucket representations. Small arrays or vectors often outperform linked lists due to better locality.

Open addressing requires careful handling of tombstones. Too many tombstones increase probe length. Periodic rehashing removes tombstones and restores performance.

Choose capacity growth carefully. Doubling the table size is common because it keeps amortized cost low and preserves distribution properties for many hash functions.

For open addressing, ensure the probe sequence can reach all slots. Certain combinations of capacity and probe rule can leave parts of the table unreachable.

Example

Suppose capacity is 8 and three keys map to the same initial index 3.

Separate chaining:

bucket 3: (k1, v1), (k2, v2), (k3, v3)

Open addressing with linear probing:

index 3: (k1, v1)
index 4: (k2, v2)
index 5: (k3, v3)

A lookup for k3 starts at index 3, then probes 4, then 5, where it finds the key.

If k2 is deleted:

index 4: tombstone

Lookup for k3 must continue past index 4 to reach index 5.

Common Mistakes

Stopping probing at a tombstone in open addressing. This breaks lookup.

Using a probe rule that does not cover the entire table.

Allowing the load factor to grow too high in open addressing.

Using linked lists for chaining in performance critical paths without measuring cache effects.

Ignoring clustering effects when evaluating performance.

Forgetting to rehash after many deletions when tombstones accumulate.