Find matching records during a join by probing a hash table built from one relation.
Hash join lookup is the probe phase of a hash join. One relation is first scanned and stored in a hash table by join key. The other relation is then scanned, and each row probes the hash table to find matching rows.
You use it when joining two datasets and the join key can be hashed efficiently.
Problem
Given two relations and , find all pairs of rows whose join keys are equal:
A hash join lookup assumes that one relation, usually the smaller one, has already been indexed by key.
Model
Build a hash table from relation :
Then, for each row , lookup:
Every row in that bucket is a candidate join match.
Algorithm
hash_join_lookup(H, s):
bucket = H[s.key]
for each row r in bucket:
if r.key == s.key:
emit join_pair(r, s)The explicit equality check is still needed because different keys may collide into the same hash bucket.
Example
Build side :
| id | name |
|---|---|
| 1 | Ada |
| 2 | Linus |
| 3 | Grace |
Probe side :
| id | order |
|---|---|
| 2 | Book |
| 3 | Pen |
| 4 | Bag |
Hash table from :
| key | rows |
|---|---|
| 1 | Ada |
| 2 | Linus |
| 3 | Grace |
Probe each row from :
| probe key | result |
|---|---|
| 2 | Linus joins with Book |
| 3 | Grace joins with Pen |
| 4 | no match |
The output contains the matching pairs for keys 2 and 3.
Correctness
The build phase stores every row in the bucket for its join key. During lookup, a probe row checks the bucket for .
If a row satisfies , then it is stored in exactly the bucket inspected by . Therefore the lookup will find it. If a row in the bucket has a different key due to a hash collision, the equality check rejects it.
Complexity
Let be the build size and be the probe size.
| phase | expected time | ||||
|---|---|---|---|---|---|
| build | $O( | R | )$ | ||
| probe | $O( | S | + z)$ | ||
| total | $O( | R | + | S | + z)$ |
Here is the number of emitted join pairs.
In the worst case, poor hashing can put many rows in the same bucket, causing quadratic behavior.
Space
The hash table stores the build relation, usually the smaller input.
When to Use
Hash join lookup is appropriate when:
- the join condition is equality
- the build side fits in memory
- input data is unsorted
- many key lookups are required
- output order is not important
For sorted inputs, sort merge join may be preferable. For indexed tables, index nested loop join may be better.
Implementation
from collections import defaultdict
def build_hash_table(rows, key):
table = defaultdict(list)
for row in rows:
table[row[key]].append(row)
return table
def hash_join(left, right, key):
table = build_hash_table(left, key)
result = []
for r in right:
for l in table.get(r[key], []):
if l[key] == r[key]:
result.append((l, r))
return resulttype Row map[string]string
func BuildHashTable(rows []Row, key string) map[string][]Row {
table := make(map[string][]Row)
for _, row := range rows {
k := row[key]
table[k] = append(table[k], row)
}
return table
}
func HashJoin(left []Row, right []Row, key string) [][2]Row {
table := BuildHashTable(left, key)
out := make([][2]Row, 0)
for _, r := range right {
k := r[key]
for _, l := range table[k] {
if l[key] == r[key] {
out = append(out, [2]Row{l, r})
}
}
}
return out
}