# Hash Join Lookup

# Hash Join Lookup

Hash join lookup is the probe phase of a hash join. One relation is first scanned and stored in a hash table by join key. The other relation is then scanned, and each row probes the hash table to find matching rows.

You use it when joining two datasets and the join key can be hashed efficiently.

## Problem

Given two relations $R$ and $S$, find all pairs of rows whose join keys are equal:

$$
r.key = s.key
$$

A hash join lookup assumes that one relation, usually the smaller one, has already been indexed by key.

## Model

Build a hash table $H$ from relation $R$:

$$
H[r.key] \gets H[r.key] \cup {r}
$$

Then, for each row $s \in S$, lookup:

$$
H[s.key]
$$

Every row in that bucket is a candidate join match.

## Algorithm

```text id="fz0jqa"
hash_join_lookup(H, s):
    bucket = H[s.key]

    for each row r in bucket:
        if r.key == s.key:
            emit join_pair(r, s)
```

The explicit equality check is still needed because different keys may collide into the same hash bucket.

## Example

Build side $R$:

| id | name  |
| -: | ----- |
|  1 | Ada   |
|  2 | Linus |
|  3 | Grace |

Probe side $S$:

| id | order |
| -: | ----- |
|  2 | Book  |
|  3 | Pen   |
|  4 | Bag   |

Hash table from $R$:

| key | rows  |
| --: | ----- |
|   1 | Ada   |
|   2 | Linus |
|   3 | Grace |

Probe each row from $S$:

| probe key | result                |
| --------: | --------------------- |
|         2 | Linus joins with Book |
|         3 | Grace joins with Pen  |
|         4 | no match              |

The output contains the matching pairs for keys 2 and 3.

## Correctness

The build phase stores every row $r \in R$ in the bucket for its join key. During lookup, a probe row $s$ checks the bucket for $s.key$.

If a row $r$ satisfies $r.key = s.key$, then it is stored in exactly the bucket inspected by $s$. Therefore the lookup will find it. If a row in the bucket has a different key due to a hash collision, the equality check rejects it.

## Complexity

Let $|R|$ be the build size and $|S|$ be the probe size.

| phase | expected time |   |       |   |       |
| ----- | ------------: | - | ----- | - | ----- |
| build |           $O( | R | )$    |   |       |
| probe |           $O( | S | + z)$ |   |       |
| total |           $O( | R | +     | S | + z)$ |

Here $z$ is the number of emitted join pairs.

In the worst case, poor hashing can put many rows in the same bucket, causing quadratic behavior.

## Space

$$
O(|R|)
$$

The hash table stores the build relation, usually the smaller input.

## When to Use

Hash join lookup is appropriate when:

* the join condition is equality
* the build side fits in memory
* input data is unsorted
* many key lookups are required
* output order is not important

For sorted inputs, sort merge join may be preferable. For indexed tables, index nested loop join may be better.

## Implementation

```python id="zyq43v"
from collections import defaultdict

def build_hash_table(rows, key):
    table = defaultdict(list)
    for row in rows:
        table[row[key]].append(row)
    return table

def hash_join(left, right, key):
    table = build_hash_table(left, key)
    result = []

    for r in right:
        for l in table.get(r[key], []):
            if l[key] == r[key]:
                result.append((l, r))

    return result
```

```go id="qe0xs1"
type Row map[string]string

func BuildHashTable(rows []Row, key string) map[string][]Row {
	table := make(map[string][]Row)

	for _, row := range rows {
		k := row[key]
		table[k] = append(table[k], row)
	}

	return table
}

func HashJoin(left []Row, right []Row, key string) [][2]Row {
	table := BuildHashTable(left, key)
	out := make([][2]Row, 0)

	for _, r := range right {
		k := r[key]
		for _, l := range table[k] {
			if l[key] == r[key] {
				out = append(out, [2]Row{l, r})
			}
		}
	}

	return out
}
```

