Polyphase Merge Sort

Polyphase merge sort is an external sorting algorithm for merging many sorted runs using a limited number of files or tapes. It avoids the rigid redistribution step used by balanced merge sort. Instead, it places different numbers of runs on different input files so each merge phase naturally prepares the next phase.

This method was especially important for tape drives, where rewinding and redistribution were expensive. The same idea still explains how careful run scheduling can reduce external merge overhead.

Problem

Given many sorted runs stored outside memory, merge them into one sorted output while using only a small number of external files.

The constraint is that not all runs can be merged at once, and only a few files are available.

Core Idea

Polyphase merge sort keeps one file empty as the output file and stores runs unevenly on the other files.

After a merge phase:

the output file receives newly merged runs
one input file becomes empty
that empty file becomes the next output file
the previous output file becomes an input file

The algorithm rotates file roles instead of redistributing runs evenly after every phase.

Run Distribution

For a 3-file polyphase merge, the number of runs is commonly chosen from Fibonacci numbers.

Example distribution:

file	initial runs
F1	5
F2	3
F3	0

Here, F3 starts empty and acts as the output file.

After merging 3 pairs of runs from F1 and F2:

file	runs after phase
F1	2
F2	0
F3	3

Now F2 is empty, so it becomes the next output file.

Algorithm

polyphase_merge_sort(runs, files):
    distribute runs unevenly across files
    choose one empty output file

    while more than one run remains:
        choose output file with zero runs
        choose input files with positive runs

        t = minimum run count among input files

        repeat t times:
            merge one run from each input file
            write merged run to output file

        update run counts
        rotate file roles

    return final remaining run

Example

Suppose we have 8 sorted runs and 3 files.

Initial distribution:

file	runs
A	5
B	3
C	0

Phase 1

Merge 3 runs from A and B into C.

file	runs
A	2
B	0
C	3

Phase 2

Merge 2 runs from A and C into B.

file	runs
A	0
B	2
C	1

Phase 3

Merge 1 run from B and C into A.

file	runs
A	1
B	1
C	0

Phase 4

Merge 1 run from A and B into C.

file	runs
A	0
B	0
C	1

The final sorted run is on C.

Correctness

Each phase merges sorted input runs into larger sorted output runs. A merge of sorted runs preserves sorted order because the next output element is always the smallest available head element among the input runs.

The run count decreases after each phase. Since every phase replaces multiple input runs with fewer larger runs, the process eventually leaves exactly one run. That final run contains all input elements, and it is sorted.

Complexity

Let $N$ be the number of elements and $R$ be the number of initial runs.

The CPU cost is dominated by merging:

O(N \log k)

where $k$ is the number of input files merged at once.

The I/O cost depends on the number of merge phases. In asymptotic form:

O\left(\frac{N}{B} \log_k R\right)

where $B$ is the block size.

Polyphase merge sort mainly improves constants and file scheduling. Its purpose is to reduce wasted passes and redistribution overhead, especially when the number of files is small.

When to Use

Use polyphase merge sort when:

data is too large for memory
only a small number of external files are available
redistribution between merge passes is expensive
sequential I/O dominates cost

For modern systems with many file handles and large memory, balanced k-way merge is often simpler. Polyphase merging remains useful as a model for external merge scheduling.

Implementation Sketch

import heapq

def merge_runs(*runs):
    heap = []
    iters = [iter(run) for run in runs]

    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass

    out = []

    while heap:
        value, i = heapq.heappop(heap)
        out.append(value)

        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

    return out

def polyphase_counts(a, b, c):
    # One step of 3-file polyphase count rotation.
    # c is assumed to be the empty output file.
    t = min(a, b)

    a -= t
    b -= t
    c += t

    return a, b, c

Notes

Polyphase merge sort is less about a different comparison rule and more about an external storage schedule. It uses the same merge primitive as merge sort, but arranges runs so that each phase naturally creates the inputs for the next one.