# Two Phase Multiway Merge Sort

# Two Phase Multiway Merge Sort

Two Phase Multiway Merge Sort (TPMMS) is a practical external sorting algorithm used in database systems. It assumes that all sorted runs can be merged in a single pass using available memory buffers.

The method minimizes the number of disk passes, which dominates cost in external memory settings.

## Model

Let:

* $N$ = total number of elements
* $M$ = memory capacity
* $B$ = block size

Memory is divided into buffers:

* 1 output buffer
* $k$ input buffers
* Total buffers $\leq M / B$

## High-Level Structure

TPMMS consists of exactly two phases:

1. **Run generation**
2. **Single multiway merge**

The key constraint:

$$
\frac{N}{M} \leq k
$$

so all runs can be merged in one pass.

## Phase 1: Run Generation

Split input into chunks that fit into memory, sort each chunk, and write back as runs.

```text id="j3a9f1"
generate_runs(file):
    runs = []
    while not end_of_file:
        chunk = read_next_M_elements(file)
        sort(chunk)
        run_file = write_to_disk(chunk)
        runs.append(run_file)
    return runs
```

Number of runs:

$$
R = \left\lceil \frac{N}{M} \right\rceil
$$

## Phase 2: Multiway Merge

Merge all runs simultaneously using $k$ input buffers and one output buffer.

```text id="6x2q0m"
multiway_merge(runs):
    initialize k input buffers
    initialize output buffer
    build min-heap with first element from each run

    while heap not empty:
        (value, r) = extract_min(heap)
        append value to output buffer

        if output buffer full:
            flush to disk

        if run r has more data:
            read next element from buffer
            push into heap
```

## Example

Suppose:

* Memory holds 3 elements
* Input: [8, 3, 5, 1, 9, 2, 7]

### Phase 1

Runs:

* [3, 5, 8]
* [1, 2, 9]
* [7]

### Phase 2

Merge all runs in one pass:

Result:

[1, 2, 3, 5, 7, 8, 9]

## Complexity

### I/O Cost

Phase 1 reads and writes all data once:

$$
2 \cdot \frac{N}{B}
$$

Phase 2 reads and writes once:

$$
2 \cdot \frac{N}{B}
$$

Total:

$$
O\left(\frac{N}{B}\right)
$$

with **two full passes**.

### CPU Cost

Heap operations:

$$
O(N \log R)
$$

where $R$ is number of runs.

## Constraint for Two-Phase Property

To ensure only one merge pass:

$$
R \leq k
\quad \Rightarrow \quad
\frac{N}{M} \leq \frac{M}{B}
$$

which gives:

$$
N \leq \frac{M^2}{B}
$$

This is the classic TPMMS condition.

## Design Notes

| aspect              | implication                        |
| ------------------- | ---------------------------------- |
| fewer runs          | larger memory improves performance |
| larger merge fan-in | fewer passes                       |
| buffer size         | affects I/O efficiency             |

## When to Use

TPMMS is appropriate when:

* dataset satisfies $N \leq M^2 / B$
* you want exactly two passes over data
* you build database operators like ORDER BY or sort-merge join

If the condition fails, additional merge passes are required.

## Implementation Sketch

```python id="v9p4ke"
import heapq

def multiway_merge(runs):
    heap = []
    iters = [iter(r) for r in runs]

    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass

    result = []

    while heap:
        val, i = heapq.heappop(heap)
        result.append(val)
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

    return result
```

## Notes

TPMMS is the standard model for:

* relational database sorting
* external GROUP BY and ORDER BY
* sort-merge join preparation

Its importance comes from the guarantee of minimal passes when memory is sufficient.

