Skip to content

Two Phase Multiway Merge Sort

External sorting algorithm that performs run generation and a single multiway merge using limited memory.

Two Phase Multiway Merge Sort (TPMMS) is a practical external sorting algorithm used in database systems. It assumes that all sorted runs can be merged in a single pass using available memory buffers.

The method minimizes the number of disk passes, which dominates cost in external memory settings.

Model

Let:

  • NN = total number of elements
  • MM = memory capacity
  • BB = block size

Memory is divided into buffers:

  • 1 output buffer
  • kk input buffers
  • Total buffers M/B\leq M / B

High-Level Structure

TPMMS consists of exactly two phases:

  1. Run generation
  2. Single multiway merge

The key constraint:

NMk \frac{N}{M} \leq k

so all runs can be merged in one pass.

Phase 1: Run Generation

Split input into chunks that fit into memory, sort each chunk, and write back as runs.

generate_runs(file):
    runs = []
    while not end_of_file:
        chunk = read_next_M_elements(file)
        sort(chunk)
        run_file = write_to_disk(chunk)
        runs.append(run_file)
    return runs

Number of runs:

R=NM R = \left\lceil \frac{N}{M} \right\rceil

Phase 2: Multiway Merge

Merge all runs simultaneously using kk input buffers and one output buffer.

multiway_merge(runs):
    initialize k input buffers
    initialize output buffer
    build min-heap with first element from each run

    while heap not empty:
        (value, r) = extract_min(heap)
        append value to output buffer

        if output buffer full:
            flush to disk

        if run r has more data:
            read next element from buffer
            push into heap

Example

Suppose:

  • Memory holds 3 elements
  • Input: [8, 3, 5, 1, 9, 2, 7]

Phase 1

Runs:

  • [3, 5, 8]
  • [1, 2, 9]
  • [7]

Phase 2

Merge all runs in one pass:

Result:

[1, 2, 3, 5, 7, 8, 9]

Complexity

I/O Cost

Phase 1 reads and writes all data once:

2NB 2 \cdot \frac{N}{B}

Phase 2 reads and writes once:

2NB 2 \cdot \frac{N}{B}

Total:

O(NB) O\left(\frac{N}{B}\right)

with two full passes.

CPU Cost

Heap operations:

O(NlogR) O(N \log R)

where RR is number of runs.

Constraint for Two-Phase Property

To ensure only one merge pass:

RkNMMB R \leq k \quad \Rightarrow \quad \frac{N}{M} \leq \frac{M}{B}

which gives:

NM2B N \leq \frac{M^2}{B}

This is the classic TPMMS condition.

Design Notes

aspectimplication
fewer runslarger memory improves performance
larger merge fan-infewer passes
buffer sizeaffects I/O efficiency

When to Use

TPMMS is appropriate when:

  • dataset satisfies NM2/BN \leq M^2 / B
  • you want exactly two passes over data
  • you build database operators like ORDER BY or sort-merge join

If the condition fails, additional merge passes are required.

Implementation Sketch

import heapq

def multiway_merge(runs):
    heap = []
    iters = [iter(r) for r in runs]

    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass

    result = []

    while heap:
        val, i = heapq.heappop(heap)
        result.append(val)
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

    return result

Notes

TPMMS is the standard model for:

  • relational database sorting
  • external GROUP BY and ORDER BY
  • sort-merge join preparation

Its importance comes from the guarantee of minimal passes when memory is sufficient.