Skip to content

Polyphase Merge Sort

External sorting algorithm that distributes sorted runs unevenly across files so repeated merges need fewer empty files and less redistribution.

Polyphase merge sort is an external sorting algorithm for merging many sorted runs using a limited number of files or tapes. It avoids the rigid redistribution step used by balanced merge sort. Instead, it places different numbers of runs on different input files so each merge phase naturally prepares the next phase.

This method was especially important for tape drives, where rewinding and redistribution were expensive. The same idea still explains how careful run scheduling can reduce external merge overhead.

Problem

Given many sorted runs stored outside memory, merge them into one sorted output while using only a small number of external files.

The constraint is that not all runs can be merged at once, and only a few files are available.

Core Idea

Polyphase merge sort keeps one file empty as the output file and stores runs unevenly on the other files.

After a merge phase:

  • the output file receives newly merged runs
  • one input file becomes empty
  • that empty file becomes the next output file
  • the previous output file becomes an input file

The algorithm rotates file roles instead of redistributing runs evenly after every phase.

Run Distribution

For a 3-file polyphase merge, the number of runs is commonly chosen from Fibonacci numbers.

Example distribution:

fileinitial runs
F15
F23
F30

Here, F3 starts empty and acts as the output file.

After merging 3 pairs of runs from F1 and F2:

fileruns after phase
F12
F20
F33

Now F2 is empty, so it becomes the next output file.

Algorithm

polyphase_merge_sort(runs, files):
    distribute runs unevenly across files
    choose one empty output file

    while more than one run remains:
        choose output file with zero runs
        choose input files with positive runs

        t = minimum run count among input files

        repeat t times:
            merge one run from each input file
            write merged run to output file

        update run counts
        rotate file roles

    return final remaining run

Example

Suppose we have 8 sorted runs and 3 files.

Initial distribution:

fileruns
A5
B3
C0

Phase 1

Merge 3 runs from A and B into C.

fileruns
A2
B0
C3

Phase 2

Merge 2 runs from A and C into B.

fileruns
A0
B2
C1

Phase 3

Merge 1 run from B and C into A.

fileruns
A1
B1
C0

Phase 4

Merge 1 run from A and B into C.

fileruns
A0
B0
C1

The final sorted run is on C.

Correctness

Each phase merges sorted input runs into larger sorted output runs. A merge of sorted runs preserves sorted order because the next output element is always the smallest available head element among the input runs.

The run count decreases after each phase. Since every phase replaces multiple input runs with fewer larger runs, the process eventually leaves exactly one run. That final run contains all input elements, and it is sorted.

Complexity

Let NN be the number of elements and RR be the number of initial runs.

The CPU cost is dominated by merging:

O(Nlogk) O(N \log k)

where kk is the number of input files merged at once.

The I/O cost depends on the number of merge phases. In asymptotic form:

O(NBlogkR) O\left(\frac{N}{B} \log_k R\right)

where BB is the block size.

Polyphase merge sort mainly improves constants and file scheduling. Its purpose is to reduce wasted passes and redistribution overhead, especially when the number of files is small.

When to Use

Use polyphase merge sort when:

  • data is too large for memory
  • only a small number of external files are available
  • redistribution between merge passes is expensive
  • sequential I/O dominates cost

For modern systems with many file handles and large memory, balanced k-way merge is often simpler. Polyphase merging remains useful as a model for external merge scheduling.

Implementation Sketch

import heapq

def merge_runs(*runs):
    heap = []
    iters = [iter(run) for run in runs]

    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass

    out = []

    while heap:
        value, i = heapq.heappop(heap)
        out.append(value)

        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

    return out
def polyphase_counts(a, b, c):
    # One step of 3-file polyphase count rotation.
    # c is assumed to be the empty output file.
    t = min(a, b)

    a -= t
    b -= t
    c += t

    return a, b, c

Notes

Polyphase merge sort is less about a different comparison rule and more about an external storage schedule. It uses the same merge primitive as merge sort, but arranges runs so that each phase naturally creates the inputs for the next one.