Skip to content

Hash Join Lookup

Find matching records during a join by probing a hash table built from one relation.

Hash join lookup is the probe phase of a hash join. One relation is first scanned and stored in a hash table by join key. The other relation is then scanned, and each row probes the hash table to find matching rows.

You use it when joining two datasets and the join key can be hashed efficiently.

Problem

Given two relations RR and SS, find all pairs of rows whose join keys are equal:

r.key=s.key r.key = s.key

A hash join lookup assumes that one relation, usually the smaller one, has already been indexed by key.

Model

Build a hash table HH from relation RR:

H[r.key]H[r.key]r H[r.key] \gets H[r.key] \cup {r}

Then, for each row sSs \in S, lookup:

H[s.key] H[s.key]

Every row in that bucket is a candidate join match.

Algorithm

hash_join_lookup(H, s):
    bucket = H[s.key]

    for each row r in bucket:
        if r.key == s.key:
            emit join_pair(r, s)

The explicit equality check is still needed because different keys may collide into the same hash bucket.

Example

Build side RR:

idname
1Ada
2Linus
3Grace

Probe side SS:

idorder
2Book
3Pen
4Bag

Hash table from RR:

keyrows
1Ada
2Linus
3Grace

Probe each row from SS:

probe keyresult
2Linus joins with Book
3Grace joins with Pen
4no match

The output contains the matching pairs for keys 2 and 3.

Correctness

The build phase stores every row rRr \in R in the bucket for its join key. During lookup, a probe row ss checks the bucket for s.keys.key.

If a row rr satisfies r.key=s.keyr.key = s.key, then it is stored in exactly the bucket inspected by ss. Therefore the lookup will find it. If a row in the bucket has a different key due to a hash collision, the equality check rejects it.

Complexity

Let R|R| be the build size and S|S| be the probe size.

phaseexpected time
build$O(R)$
probe$O(S+ z)$
total$O(R+S+ z)$

Here zz is the number of emitted join pairs.

In the worst case, poor hashing can put many rows in the same bucket, causing quadratic behavior.

Space

O(R) O(|R|)

The hash table stores the build relation, usually the smaller input.

When to Use

Hash join lookup is appropriate when:

  • the join condition is equality
  • the build side fits in memory
  • input data is unsorted
  • many key lookups are required
  • output order is not important

For sorted inputs, sort merge join may be preferable. For indexed tables, index nested loop join may be better.

Implementation

from collections import defaultdict

def build_hash_table(rows, key):
    table = defaultdict(list)
    for row in rows:
        table[row[key]].append(row)
    return table

def hash_join(left, right, key):
    table = build_hash_table(left, key)
    result = []

    for r in right:
        for l in table.get(r[key], []):
            if l[key] == r[key]:
                result.append((l, r))

    return result
type Row map[string]string

func BuildHashTable(rows []Row, key string) map[string][]Row {
	table := make(map[string][]Row)

	for _, row := range rows {
		k := row[key]
		table[k] = append(table[k], row)
	}

	return table
}

func HashJoin(left []Row, right []Row, key string) [][2]Row {
	table := BuildHashTable(left, key)
	out := make([][2]Row, 0)

	for _, r := range right {
		k := r[key]
		for _, l := range table[k] {
			if l[key] == r[key] {
				out = append(out, [2]Row{l, r})
			}
		}
	}

	return out
}