External sorting algorithm that distributes sorted runs unevenly across files so repeated merges need fewer empty files and less redistribution.
Polyphase merge sort is an external sorting algorithm for merging many sorted runs using a limited number of files or tapes. It avoids the rigid redistribution step used by balanced merge sort. Instead, it places different numbers of runs on different input files so each merge phase naturally prepares the next phase.
This method was especially important for tape drives, where rewinding and redistribution were expensive. The same idea still explains how careful run scheduling can reduce external merge overhead.
Problem
Given many sorted runs stored outside memory, merge them into one sorted output while using only a small number of external files.
The constraint is that not all runs can be merged at once, and only a few files are available.
Core Idea
Polyphase merge sort keeps one file empty as the output file and stores runs unevenly on the other files.
After a merge phase:
- the output file receives newly merged runs
- one input file becomes empty
- that empty file becomes the next output file
- the previous output file becomes an input file
The algorithm rotates file roles instead of redistributing runs evenly after every phase.
Run Distribution
For a 3-file polyphase merge, the number of runs is commonly chosen from Fibonacci numbers.
Example distribution:
| file | initial runs |
|---|---|
| F1 | 5 |
| F2 | 3 |
| F3 | 0 |
Here, F3 starts empty and acts as the output file.
After merging 3 pairs of runs from F1 and F2:
| file | runs after phase |
|---|---|
| F1 | 2 |
| F2 | 0 |
| F3 | 3 |
Now F2 is empty, so it becomes the next output file.
Algorithm
polyphase_merge_sort(runs, files):
distribute runs unevenly across files
choose one empty output file
while more than one run remains:
choose output file with zero runs
choose input files with positive runs
t = minimum run count among input files
repeat t times:
merge one run from each input file
write merged run to output file
update run counts
rotate file roles
return final remaining runExample
Suppose we have 8 sorted runs and 3 files.
Initial distribution:
| file | runs |
|---|---|
| A | 5 |
| B | 3 |
| C | 0 |
Phase 1
Merge 3 runs from A and B into C.
| file | runs |
|---|---|
| A | 2 |
| B | 0 |
| C | 3 |
Phase 2
Merge 2 runs from A and C into B.
| file | runs |
|---|---|
| A | 0 |
| B | 2 |
| C | 1 |
Phase 3
Merge 1 run from B and C into A.
| file | runs |
|---|---|
| A | 1 |
| B | 1 |
| C | 0 |
Phase 4
Merge 1 run from A and B into C.
| file | runs |
|---|---|
| A | 0 |
| B | 0 |
| C | 1 |
The final sorted run is on C.
Correctness
Each phase merges sorted input runs into larger sorted output runs. A merge of sorted runs preserves sorted order because the next output element is always the smallest available head element among the input runs.
The run count decreases after each phase. Since every phase replaces multiple input runs with fewer larger runs, the process eventually leaves exactly one run. That final run contains all input elements, and it is sorted.
Complexity
Let be the number of elements and be the number of initial runs.
The CPU cost is dominated by merging:
where is the number of input files merged at once.
The I/O cost depends on the number of merge phases. In asymptotic form:
where is the block size.
Polyphase merge sort mainly improves constants and file scheduling. Its purpose is to reduce wasted passes and redistribution overhead, especially when the number of files is small.
When to Use
Use polyphase merge sort when:
- data is too large for memory
- only a small number of external files are available
- redistribution between merge passes is expensive
- sequential I/O dominates cost
For modern systems with many file handles and large memory, balanced k-way merge is often simpler. Polyphase merging remains useful as a model for external merge scheduling.
Implementation Sketch
import heapq
def merge_runs(*runs):
heap = []
iters = [iter(run) for run in runs]
for i, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), i))
except StopIteration:
pass
out = []
while heap:
value, i = heapq.heappop(heap)
out.append(value)
try:
heapq.heappush(heap, (next(iters[i]), i))
except StopIteration:
pass
return outdef polyphase_counts(a, b, c):
# One step of 3-file polyphase count rotation.
# c is assumed to be the empty output file.
t = min(a, b)
a -= t
b -= t
c += t
return a, b, cNotes
Polyphase merge sort is less about a different comparison rule and more about an external storage schedule. It uses the same merge primitive as merge sort, but arranges runs so that each phase naturally creates the inputs for the next one.