# 5.1 Hash Tables

# 5.1 Hash Tables

## Problem

You need a data structure that supports fast insertion, lookup, and deletion by key.

Arrays give fast access when the key is already a small integer index. A hash table generalizes that idea. It accepts keys such as strings, records, tuples, or large integers, converts each key into a table position, and stores the associated value there.

## Solution

Use a hash function to convert each key into an integer hash code, then reduce that hash code to a valid bucket index.

```text
index = hash(key) mod capacity
```

The table stores entries in buckets. Each entry contains at least a key and usually also a value.

```text
Entry:
    key
    value

HashTable:
    buckets: array of bucket
    size: number of stored entries
    capacity: number of buckets
```

A basic lookup follows the same path used during insertion.

```text
lookup(table, key):
    i = hash(key) mod table.capacity

    for each entry in table.buckets[i]:
        if entry.key == key:
            return entry.value

    return not_found
```

Insertion checks whether the key already exists in its bucket. If it does, insertion updates the value. If it does not, insertion appends a new entry.

```text
insert(table, key, value):
    i = hash(key) mod table.capacity

    for each entry in table.buckets[i]:
        if entry.key == key:
            entry.value = value
            return

    append (key, value) to table.buckets[i]
    table.size = table.size + 1
```

Deletion removes the matching entry from the bucket.

```text
delete(table, key):
    i = hash(key) mod table.capacity

    for each entry in table.buckets[i]:
        if entry.key == key:
            remove entry
            table.size = table.size - 1
            return true

    return false
```

## Discussion

The hash table is built around a simple invariant: every stored key must be placed in the bucket selected by the same indexing rule used for lookup. If insertion uses `hash(key) mod capacity`, then lookup and deletion must use exactly the same expression with the same capacity. When the table is resized, all stored keys must be reinserted or reindexed because their bucket positions may change.

The role of the hash function is to spread keys across the table. A table with many buckets performs well only when entries are distributed reasonably evenly. If many keys land in the same bucket, lookup becomes a scan through that bucket. In the worst case, a hash table can degrade to linear search.

The capacity of the table matters. A small capacity causes many collisions. A large capacity wastes memory. Implementations usually monitor the load factor:

```text
load_factor = size / capacity
```

When the load factor grows past a chosen threshold, the table allocates a larger bucket array and moves all existing entries into the new array. This operation is expensive when it happens, but it keeps future operations fast.

Separate chaining is the simplest collision strategy. Each bucket stores a small collection of entries that share the same bucket index. This makes the algorithms easy to explain and robust under moderate collision rates. Production implementations may use linked lists, arrays, small vectors, trees, or hybrid layouts for the bucket storage.

A hash table has two notions of equality. The hash function provides a quick numerical summary of a key, but it does not prove that two keys are equal. Two different keys may produce the same hash code or the same bucket index. Therefore, lookup must always compare the actual stored key with the requested key. Skipping this equality check produces incorrect results.

## Correctness

The correctness of lookup follows from the placement invariant. Suppose a key `k` was inserted into the table while the table had capacity `c`. During insertion, the table computed `hash(k) mod c` and stored the entry in that bucket. If no resizing has occurred, lookup computes the same index for the same key and examines the same bucket. Since lookup compares keys directly, it returns the value associated with `k`.

After resizing, correctness is preserved only if every existing entry is moved according to the new capacity. Rehashing restores the same placement invariant for the new bucket array. Once that invariant holds again, the same lookup argument applies.

Deletion is correct for the same reason. It examines the only bucket in which the key can be stored. If the key is present, deletion removes the matching entry. If the key is absent from that bucket, the key is absent from the table.

## Complexity

Let `n` be the number of entries and `m` be the number of buckets. The load factor is `n / m`.

With a good hash function and a controlled load factor, insertion, lookup, and deletion take expected constant time. More precisely, the expected bucket length is proportional to the load factor. If the load factor is bounded by a constant, the expected cost of scanning one bucket is also bounded by a constant.

The worst case is different. If all keys collide into one bucket, lookup, insertion, and deletion may take `O(n)` time. Hash tables are therefore best understood as expected fast structures, not as unconditional constant time structures.

Resizing costs `O(n)` because every entry must be moved. Across a long sequence of insertions, this cost is usually amortized. If the table doubles its capacity when it grows, the average resizing cost per insertion remains constant.

## Implementation Notes

Choose an initial capacity large enough for the expected workload when possible. This avoids repeated resizing during early growth.

Keep the load factor below a fixed threshold. For separate chaining, a threshold near `1.0` is common in simple implementations. For open addressing, the threshold is usually lower because long probe sequences appear quickly as the table fills.

Always store the original key, not only the hash code. Hash codes are not unique.

When keys are composite, hash all fields that participate in equality. If two records are considered equal by `(user_id, timestamp)`, then the hash function must use both `user_id` and `timestamp`.

Avoid deriving bucket indexes by taking low bits from a weak hash function unless the table capacity and hash function are designed together. Poor mixing can create severe clustering.

## Example

Suppose the table has `8` buckets and the following simplified hash results.

| Key     | Hash | Bucket |
| ------- | ---: | -----: |
| `"cat"` |   19 |      3 |
| `"dog"` |   11 |      3 |
| `"ant"` |   20 |      4 |
| `"owl"` |   29 |      5 |

Both `"cat"` and `"dog"` land in bucket `3` because their hashes have the same remainder modulo `8`.

```text
bucket 0: empty
bucket 1: empty
bucket 2: empty
bucket 3: ("cat", value), ("dog", value)
bucket 4: ("ant", value)
bucket 5: ("owl", value)
bucket 6: empty
bucket 7: empty
```

A lookup for `"dog"` computes bucket `3`, scans that bucket, compares `"dog"` with each stored key, and returns the value attached to `"dog"`.

## Common Mistakes

Do not assume that equal hash codes imply equal keys. The table must still compare keys.

Do not forget to rehash after resizing. Copying buckets directly into a larger array breaks lookup because bucket indexes depend on capacity.

Do not let the load factor grow without bound. A table that never resizes eventually behaves like a collection of long lists.

Do not use a hash function that ignores part of the key. If equality uses multiple fields but hashing uses only one, collisions become more likely and performance becomes unstable.

Do not benchmark only random inputs. Random inputs often hide clustering problems that appear in real keys, such as paths, timestamps, user identifiers, and sequential strings.

