Skip to content

Sort Merge Join Sort Phase

The sorting stage used before a sort-merge join when one or both inputs are not already ordered by the join key.

The sort phase of sort-merge join orders each input relation by the join key before the merge phase begins. If an input is already ordered by a suitable index or previous query operator, the database can skip sorting that input.

This phase is important because the merge join itself requires ordered streams. Without sorted inputs, the join cannot advance both sides efficiently.

Problem

Given two relations:

R R

and

S S

with join keys:

R.a R.a

and

S.b S.b

produce both inputs ordered by their join keys so the merge join phase can compare them sequentially.

Core Idea

Sort each unsorted input by its join key:

SELECT *
FROM R
JOIN S
ON R.a = S.b;

requires:

sort R by R.a
sort S by S.b
merge matching key groups

The sort phase may use in-memory sort or external sort depending on input size and memory limits.

Algorithm

sort_merge_join_sort_phase(R, S, key_R, key_S, memory_limit):
    if R is not ordered by key_R:
        R_sorted = external_sort(R, key_R, memory_limit)
    else:
        R_sorted = R

    if S is not ordered by key_S:
        S_sorted = external_sort(S, key_S, memory_limit)
    else:
        S_sorted = S

    return R_sorted, S_sorted

The sorted streams are then passed to the merge phase.

merge_join(R_sorted, S_sorted):
    advance through both streams by join key

    when keys match:
        emit all pairs from matching key groups

    when one key is smaller:
        advance that side

Example

Relation R:

ida
130
210
320

Relation S:

idb
820
910
1040

Sort R by a:

ida
210
320
130

Sort S by b:

idb
910
820
1040

Now the merge join can scan both tables in order and match keys 10 and 20.

Correctness

The sort phase preserves all rows and orders each input by its join key. Once both inputs are sorted, equal join keys appear in contiguous groups.

The merge phase can then compare the current key from each side. If one key is smaller, no later row on the other side can match it, so that side can advance safely. If the keys are equal, all rows in both equal-key groups are paired. Therefore the sorted inputs are sufficient for a correct merge join.

Complexity

Let:

  • NRN_R be the number of rows in RR
  • NSN_S be the number of rows in SS
  • BB be the block size
  • MM be available memory
  • kk be merge fan-in

Sorting cost:

O(NRlogNR+NSlogNS) O(N_R \log N_R + N_S \log N_S)

in the comparison model.

External I/O cost:

O(NRBlogkNRM)+O(NSBlogkNSM) O\left(\frac{N_R}{B} \log_k \frac{N_R}{M}\right) + O\left(\frac{N_S}{B} \log_k \frac{N_S}{M}\right)

The later merge join scan is linear:

O(NR+NS) O(N_R + N_S)

plus output size.

Database Optimizations

optimizationeffect
existing index orderskips sorting one input
clustered table ordermay provide natural ordering
partial ordermay allow incremental sort
projection before sortreduces row width
sort sharingreuses order for later operators
spill fileshandles memory overflow

When to Use

The sort phase is needed when a sort-merge join is selected and one or both inputs lack the required ordering.

It is often chosen when:

  • join inputs are large
  • ordered output is also needed
  • useful indexes already sort one side
  • hash join memory would be too large
  • equality join keys have many duplicates

Implementation Sketch

def sort_merge_join_sort_phase(R, S, key_R, key_S):
    R_sorted = sorted(R, key=key_R)
    S_sorted = sorted(S, key=key_S)
    return R_sorted, S_sorted
def merge_join(R, S, key_R, key_S):
    i = 0
    j = 0
    out = []

    while i < len(R) and j < len(S):
        kr = key_R(R[i])
        ks = key_S(S[j])

        if kr < ks:
            i += 1
        elif kr > ks:
            j += 1
        else:
            r_group = []
            s_group = []
            key = kr

            while i < len(R) and key_R(R[i]) == key:
                r_group.append(R[i])
                i += 1

            while j < len(S) and key_S(S[j]) == key:
                s_group.append(S[j])
                j += 1

            for r in r_group:
                for s in s_group:
                    out.append((r, s))

    return out

Notes

The sort phase is the expensive preparation step of sort-merge join. Its value comes from turning two unordered relations into ordered streams, after which the join can proceed by sequential scanning.