# 5.15 Hash Joins

# 5.15 Hash Joins

## Problem

You need to join two collections by matching keys.

A nested loop join compares every item from the first collection with every item from the second collection. This is simple, but it costs `O(nm)` for collections of sizes `n` and `m`. A hash join reduces this cost by indexing one side in a hash map.

## Solution

Build a hash map from the smaller collection, then scan the larger collection and probe the map.

```text id="rrtgzx"
hash_join(left, right, key_left, key_right):
    table = empty map from key to list of right items

    for r in right:
        k = key_right(r)
        append r to table[k]

    output = empty list

    for l in left:
        k = key_left(l)
        matches = table[k]

        for r in matches:
            append (l, r) to output

    return output
```

If the join key is unique on the indexed side, the map can store a single item instead of a list.

```text id="rpd57a"
table: Map<Key, Item>
```

## Discussion

A hash join has two phases. The build phase creates a map from join key to matching rows. The probe phase scans the other collection and performs expected constant time lookups.

The indexed side should usually be the smaller side because the map must fit in memory. If the smaller side does not fit, the join can be partitioned by hash into smaller chunks and processed one partition at a time.

Duplicate keys require attention. If several rows share the same key, the map value must be a list. During probing, each matching pair must be emitted. This can make the output much larger than either input.

Hash joins are widely used in databases, data processing systems, and in-memory algorithms. The same pattern appears in ordinary code whenever one list must be matched against another by id.

## Correctness

For each item `r` in the indexed collection, the build phase stores `r` under key `key_right(r)`.

For each item `l` in the probed collection, the probe phase computes `key_left(l)` and retrieves exactly the indexed items with the same key. It then emits all pairs `(l, r)` whose keys match.

Every emitted pair is correct because both items came from the same map key. Every valid pair is emitted because the indexed item was inserted under that key and the probed item later searches that same key.

## Complexity

Let `n` and `m` be the input sizes.

The build phase takes expected `O(m)` time if the right side is indexed.

The probe phase takes expected `O(n + z)` time, where `z` is the number of emitted pairs.

Total expected time is `O(n + m + z)`.

Space usage is `O(m)` for the hash map, plus output space.

Worst case time can degrade if many keys collide or if one join key has many matching rows. The output size itself may be `O(nm)` when all rows share the same key.

## Example

Join users with orders.

```text id="39ejgk"
users = [
    {id: 1, name: "Ada"},
    {id: 2, name: "Grace"}
]

orders = [
    {user_id: 1, item: "book"},
    {user_id: 1, item: "pen"},
    {user_id: 2, item: "notebook"}
]
```

Build a map from `user_id` to orders:

```text id="2mkbnc"
1 -> [{user_id: 1, item: "book"}, {user_id: 1, item: "pen"}]
2 -> [{user_id: 2, item: "notebook"}]
```

Probe users by `id`:

```text id="lr8s90"
("Ada", "book")
("Ada", "pen")
("Grace", "notebook")
```

## Implementation Notes

Index the smaller side when memory is limited.

Store lists for duplicate keys.

Preallocate the hash map if the input size is known.

Use stable key extraction functions. Both sides must derive keys in the same logical format.

For large joins, stream output instead of storing all pairs.

## Common Mistakes

Storing only one item per key when duplicate keys are possible.

Forgetting that output size can dominate runtime.

Building the map on the larger side when the smaller side would fit better.

Using inconsistent key normalization across the two inputs.

Assuming hash join preserves input order unless explicitly designed to do so.