# Sort Merge Join Sort Phase

# Sort Merge Join Sort Phase

The sort phase of sort-merge join orders each input relation by the join key before the merge phase begins. If an input is already ordered by a suitable index or previous query operator, the database can skip sorting that input.

This phase is important because the merge join itself requires ordered streams. Without sorted inputs, the join cannot advance both sides efficiently.

## Problem

Given two relations:

$$
R
$$

and

$$
S
$$

with join keys:

$$
R.a
$$

and

$$
S.b
$$

produce both inputs ordered by their join keys so the merge join phase can compare them sequentially.

## Core Idea

Sort each unsorted input by its join key:

```sql
SELECT *
FROM R
JOIN S
ON R.a = S.b;
```

requires:

```text
sort R by R.a
sort S by S.b
merge matching key groups
```

The sort phase may use in-memory sort or external sort depending on input size and memory limits.

## Algorithm

```text id="0m8b7q"
sort_merge_join_sort_phase(R, S, key_R, key_S, memory_limit):
    if R is not ordered by key_R:
        R_sorted = external_sort(R, key_R, memory_limit)
    else:
        R_sorted = R

    if S is not ordered by key_S:
        S_sorted = external_sort(S, key_S, memory_limit)
    else:
        S_sorted = S

    return R_sorted, S_sorted
```

The sorted streams are then passed to the merge phase.

```text id="45gkvi"
merge_join(R_sorted, S_sorted):
    advance through both streams by join key

    when keys match:
        emit all pairs from matching key groups

    when one key is smaller:
        advance that side
```

## Example

Relation R:

| id |  a |
| -: | -: |
|  1 | 30 |
|  2 | 10 |
|  3 | 20 |

Relation S:

| id |  b |
| -: | -: |
|  8 | 20 |
|  9 | 10 |
| 10 | 40 |

Sort R by `a`:

| id |  a |
| -: | -: |
|  2 | 10 |
|  3 | 20 |
|  1 | 30 |

Sort S by `b`:

| id |  b |
| -: | -: |
|  9 | 10 |
|  8 | 20 |
| 10 | 40 |

Now the merge join can scan both tables in order and match keys 10 and 20.

## Correctness

The sort phase preserves all rows and orders each input by its join key. Once both inputs are sorted, equal join keys appear in contiguous groups.

The merge phase can then compare the current key from each side. If one key is smaller, no later row on the other side can match it, so that side can advance safely. If the keys are equal, all rows in both equal-key groups are paired. Therefore the sorted inputs are sufficient for a correct merge join.

## Complexity

Let:

* $N_R$ be the number of rows in $R$
* $N_S$ be the number of rows in $S$
* $B$ be the block size
* $M$ be available memory
* $k$ be merge fan-in

Sorting cost:

$$
O(N_R \log N_R + N_S \log N_S)
$$

in the comparison model.

External I/O cost:

$$
O\left(\frac{N_R}{B} \log_k \frac{N_R}{M}\right)
+
O\left(\frac{N_S}{B} \log_k \frac{N_S}{M}\right)
$$

The later merge join scan is linear:

$$
O(N_R + N_S)
$$

plus output size.

## Database Optimizations

| optimization           | effect                           |
| ---------------------- | -------------------------------- |
| existing index order   | skips sorting one input          |
| clustered table order  | may provide natural ordering     |
| partial order          | may allow incremental sort       |
| projection before sort | reduces row width                |
| sort sharing           | reuses order for later operators |
| spill files            | handles memory overflow          |

## When to Use

The sort phase is needed when a sort-merge join is selected and one or both inputs lack the required ordering.

It is often chosen when:

* join inputs are large
* ordered output is also needed
* useful indexes already sort one side
* hash join memory would be too large
* equality join keys have many duplicates

## Implementation Sketch

```python id="n2w9vz"
def sort_merge_join_sort_phase(R, S, key_R, key_S):
    R_sorted = sorted(R, key=key_R)
    S_sorted = sorted(S, key=key_S)
    return R_sorted, S_sorted
```

```python id="wa91kv"
def merge_join(R, S, key_R, key_S):
    i = 0
    j = 0
    out = []

    while i < len(R) and j < len(S):
        kr = key_R(R[i])
        ks = key_S(S[j])

        if kr < ks:
            i += 1
        elif kr > ks:
            j += 1
        else:
            r_group = []
            s_group = []
            key = kr

            while i < len(R) and key_R(R[i]) == key:
                r_group.append(R[i])
                i += 1

            while j < len(S) and key_S(S[j]) == key:
                s_group.append(S[j])
                j += 1

            for r in r_group:
                for s in s_group:
                    out.append((r, s))

    return out
```

## Notes

The sort phase is the expensive preparation step of sort-merge join. Its value comes from turning two unordered relations into ordered streams, after which the join can proceed by sequential scanning.

