The sorting stage used before a sort-merge join when one or both inputs are not already ordered by the join key.
The sort phase of sort-merge join orders each input relation by the join key before the merge phase begins. If an input is already ordered by a suitable index or previous query operator, the database can skip sorting that input.
This phase is important because the merge join itself requires ordered streams. Without sorted inputs, the join cannot advance both sides efficiently.
Problem
Given two relations:
and
with join keys:
and
produce both inputs ordered by their join keys so the merge join phase can compare them sequentially.
Core Idea
Sort each unsorted input by its join key:
SELECT *
FROM R
JOIN S
ON R.a = S.b;requires:
sort R by R.a
sort S by S.b
merge matching key groupsThe sort phase may use in-memory sort or external sort depending on input size and memory limits.
Algorithm
sort_merge_join_sort_phase(R, S, key_R, key_S, memory_limit):
if R is not ordered by key_R:
R_sorted = external_sort(R, key_R, memory_limit)
else:
R_sorted = R
if S is not ordered by key_S:
S_sorted = external_sort(S, key_S, memory_limit)
else:
S_sorted = S
return R_sorted, S_sortedThe sorted streams are then passed to the merge phase.
merge_join(R_sorted, S_sorted):
advance through both streams by join key
when keys match:
emit all pairs from matching key groups
when one key is smaller:
advance that sideExample
Relation R:
| id | a |
|---|---|
| 1 | 30 |
| 2 | 10 |
| 3 | 20 |
Relation S:
| id | b |
|---|---|
| 8 | 20 |
| 9 | 10 |
| 10 | 40 |
Sort R by a:
| id | a |
|---|---|
| 2 | 10 |
| 3 | 20 |
| 1 | 30 |
Sort S by b:
| id | b |
|---|---|
| 9 | 10 |
| 8 | 20 |
| 10 | 40 |
Now the merge join can scan both tables in order and match keys 10 and 20.
Correctness
The sort phase preserves all rows and orders each input by its join key. Once both inputs are sorted, equal join keys appear in contiguous groups.
The merge phase can then compare the current key from each side. If one key is smaller, no later row on the other side can match it, so that side can advance safely. If the keys are equal, all rows in both equal-key groups are paired. Therefore the sorted inputs are sufficient for a correct merge join.
Complexity
Let:
- be the number of rows in
- be the number of rows in
- be the block size
- be available memory
- be merge fan-in
Sorting cost:
in the comparison model.
External I/O cost:
The later merge join scan is linear:
plus output size.
Database Optimizations
| optimization | effect |
|---|---|
| existing index order | skips sorting one input |
| clustered table order | may provide natural ordering |
| partial order | may allow incremental sort |
| projection before sort | reduces row width |
| sort sharing | reuses order for later operators |
| spill files | handles memory overflow |
When to Use
The sort phase is needed when a sort-merge join is selected and one or both inputs lack the required ordering.
It is often chosen when:
- join inputs are large
- ordered output is also needed
- useful indexes already sort one side
- hash join memory would be too large
- equality join keys have many duplicates
Implementation Sketch
def sort_merge_join_sort_phase(R, S, key_R, key_S):
R_sorted = sorted(R, key=key_R)
S_sorted = sorted(S, key=key_S)
return R_sorted, S_sorteddef merge_join(R, S, key_R, key_S):
i = 0
j = 0
out = []
while i < len(R) and j < len(S):
kr = key_R(R[i])
ks = key_S(S[j])
if kr < ks:
i += 1
elif kr > ks:
j += 1
else:
r_group = []
s_group = []
key = kr
while i < len(R) and key_R(R[i]) == key:
r_group.append(R[i])
i += 1
while j < len(S) and key_S(S[j]) == key:
s_group.append(S[j])
j += 1
for r in r_group:
for s in s_group:
out.append((r, s))
return outNotes
The sort phase is the expensive preparation step of sort-merge join. Its value comes from turning two unordered relations into ordered streams, after which the join can proceed by sequential scanning.