Sort Merge Join Sort Phase

The sort phase of sort-merge join orders each input relation by the join key before the merge phase begins. If an input is already ordered by a suitable index or previous query operator, the database can skip sorting that input.

This phase is important because the merge join itself requires ordered streams. Without sorted inputs, the join cannot advance both sides efficiently.

Problem

Given two relations:

R

and

S

with join keys:

R.a

and

S.b

produce both inputs ordered by their join keys so the merge join phase can compare them sequentially.

Core Idea

Sort each unsorted input by its join key:

SELECT *
FROM R
JOIN S
ON R.a = S.b;

requires:

sort R by R.a
sort S by S.b
merge matching key groups

The sort phase may use in-memory sort or external sort depending on input size and memory limits.

Algorithm

sort_merge_join_sort_phase(R, S, key_R, key_S, memory_limit):
    if R is not ordered by key_R:
        R_sorted = external_sort(R, key_R, memory_limit)
    else:
        R_sorted = R

    if S is not ordered by key_S:
        S_sorted = external_sort(S, key_S, memory_limit)
    else:
        S_sorted = S

    return R_sorted, S_sorted

The sorted streams are then passed to the merge phase.

merge_join(R_sorted, S_sorted):
    advance through both streams by join key

    when keys match:
        emit all pairs from matching key groups

    when one key is smaller:
        advance that side

Example

Relation R:

id	a
1	30
2	10
3	20

Relation S:

id	b
8	20
9	10
10	40

Sort R by a:

id	a
2	10
3	20
1	30

Sort S by b:

id	b
9	10
8	20
10	40

Now the merge join can scan both tables in order and match keys 10 and 20.

Correctness

The sort phase preserves all rows and orders each input by its join key. Once both inputs are sorted, equal join keys appear in contiguous groups.

The merge phase can then compare the current key from each side. If one key is smaller, no later row on the other side can match it, so that side can advance safely. If the keys are equal, all rows in both equal-key groups are paired. Therefore the sorted inputs are sufficient for a correct merge join.

Complexity

Let:

$N_R$ be the number of rows in $R$
$N_S$ be the number of rows in $S$
$B$ be the block size
$M$ be available memory
$k$ be merge fan-in

Sorting cost:

O(N_R \log N_R + N_S \log N_S)

in the comparison model.

External I/O cost:

O\left(\frac{N_R}{B} \log_k \frac{N_R}{M}\right) + O\left(\frac{N_S}{B} \log_k \frac{N_S}{M}\right)

The later merge join scan is linear:

O(N_R + N_S)

plus output size.

Database Optimizations

optimization	effect
existing index order	skips sorting one input
clustered table order	may provide natural ordering
partial order	may allow incremental sort
projection before sort	reduces row width
sort sharing	reuses order for later operators
spill files	handles memory overflow

When to Use

The sort phase is needed when a sort-merge join is selected and one or both inputs lack the required ordering.

It is often chosen when:

join inputs are large
ordered output is also needed
useful indexes already sort one side
hash join memory would be too large
equality join keys have many duplicates

Implementation Sketch

def sort_merge_join_sort_phase(R, S, key_R, key_S):
    R_sorted = sorted(R, key=key_R)
    S_sorted = sorted(S, key=key_S)
    return R_sorted, S_sorted

def merge_join(R, S, key_R, key_S):
    i = 0
    j = 0
    out = []

    while i < len(R) and j < len(S):
        kr = key_R(R[i])
        ks = key_S(S[j])

        if kr < ks:
            i += 1
        elif kr > ks:
            j += 1
        else:
            r_group = []
            s_group = []
            key = kr

            while i < len(R) and key_R(R[i]) == key:
                r_group.append(R[i])
                i += 1

            while j < len(S) and key_S(S[j]) == key:
                s_group.append(S[j])
                j += 1

            for r in r_group:
                for s in s_group:
                    out.append((r, s))

    return out

Notes

The sort phase is the expensive preparation step of sort-merge join. Its value comes from turning two unordered relations into ordered streams, after which the join can proceed by sequential scanning.