# 5.3 Collision Handling

# 5.3 Collision Handling

## Problem

Different keys can produce the same bucket index. A hash table must store and retrieve all such keys without losing correctness or degrading excessively in performance.

## Solution

Use a collision resolution strategy. The two standard approaches are separate chaining and open addressing.

### Separate chaining

Each bucket stores a small collection of entries.

```text
bucket[i] = list of (key, value)
```

Lookup scans only the selected bucket.

```text
lookup(table, key):
    i = hash(key) mod capacity
    for entry in bucket[i]:
        if entry.key == key:
            return entry.value
    return not_found
```

Insertion appends to the bucket if the key is new.

### Open addressing

All entries live in the table array. When a collision occurs, probe alternative slots.

```text
index = hash(key) mod capacity
```

If the slot is occupied, move to the next candidate according to a probe rule.

Linear probing:

```text
probe(i, k) = (i + k) mod capacity
```

Lookup continues probing until it finds the key or an empty slot.

```text
lookup(table, key):
    i = hash(key) mod capacity
    k = 0

    while slot(probe(i, k)) is occupied:
        if slot.key == key:
            return slot.value
        k = k + 1

    return not_found
```

Deletion in open addressing uses a special marker (tombstone) instead of clearing the slot.

## Discussion

Collision handling determines the real behavior of a hash table. The choice affects memory layout, cache locality, deletion semantics, and worst case degradation.

Separate chaining keeps collisions local to each bucket. The cost of a lookup equals the cost of hashing plus scanning one bucket. Buckets can grow independently, so performance depends on the distribution of keys. This approach tolerates higher load factors because buckets expand dynamically. It also simplifies deletion, since removing an entry from a bucket does not affect others.

Open addressing keeps all entries in a contiguous array. This improves cache locality because probing touches nearby memory. For workloads dominated by lookups, this can outperform chaining even with slightly higher collision rates. However, performance becomes sensitive to the load factor. As the table fills, probe sequences grow longer, and clusters form. Deletion becomes more complex because removing an entry must not break probe chains, which is why tombstones are used.

Clustering is a key phenomenon in open addressing. Linear probing tends to create primary clusters, where consecutive occupied slots attract more insertions. Alternative probe schemes such as quadratic probing or double hashing reduce clustering but introduce their own trade offs in implementation complexity and memory access patterns.

## Correctness

For separate chaining, correctness follows from the bucket invariant. Every key is stored in exactly one bucket determined by the hash function and table capacity. Lookup inspects that bucket and compares keys directly, so it finds the entry if it exists.

For open addressing, correctness depends on preserving probe sequences. When inserting a key, the algorithm follows a probe sequence until it finds an empty slot. Lookup must follow the same sequence and must not stop prematurely. This is why an empty slot signals that the key is absent, while a tombstone signals that probing must continue.

Deletion must preserve this invariant. If deletion simply clears a slot, it may break a probe chain. A later lookup could stop at that empty slot and incorrectly conclude that a key is absent. Tombstones avoid this by marking the slot as previously occupied, allowing lookup to continue probing.

## Complexity

Let the load factor be `α = size / capacity`.

For separate chaining, the expected bucket size is proportional to `α`. Lookup, insertion, and deletion take expected time `O(1 + α)`. If `α` is kept bounded by a constant, operations are expected constant time.

For open addressing, expected probe length depends more sharply on `α`. With linear probing, average probe length grows as the table approaches capacity. When `α` remains below a threshold such as `0.7`, operations are still close to constant time. As `α` approaches `1`, performance degrades rapidly.

Worst case behavior for both strategies is `O(n)`. This occurs when all keys collide or when probe sequences span most of the table.

## Implementation Notes

Separate chaining benefits from compact bucket representations. Small arrays or vectors often outperform linked lists due to better locality.

Open addressing requires careful handling of tombstones. Too many tombstones increase probe length. Periodic rehashing removes tombstones and restores performance.

Choose capacity growth carefully. Doubling the table size is common because it keeps amortized cost low and preserves distribution properties for many hash functions.

For open addressing, ensure the probe sequence can reach all slots. Certain combinations of capacity and probe rule can leave parts of the table unreachable.

## Example

Suppose capacity is `8` and three keys map to the same initial index `3`.

Separate chaining:

```text
bucket 3: (k1, v1), (k2, v2), (k3, v3)
```

Open addressing with linear probing:

```text
index 3: (k1, v1)
index 4: (k2, v2)
index 5: (k3, v3)
```

A lookup for `k3` starts at index `3`, then probes `4`, then `5`, where it finds the key.

If `k2` is deleted:

```text
index 4: tombstone
```

Lookup for `k3` must continue past index `4` to reach index `5`.

## Common Mistakes

Stopping probing at a tombstone in open addressing. This breaks lookup.

Using a probe rule that does not cover the entire table.

Allowing the load factor to grow too high in open addressing.

Using linked lists for chaining in performance critical paths without measuring cache effects.

Ignoring clustering effects when evaluating performance.

Forgetting to rehash after many deletions when tombstones accumulate.

