External sorting algorithm that distributes sorted runs across sequential storage tapes and repeatedly merges them.
Tape merge sort is an external sorting algorithm designed for sequential storage. It was historically used with magnetic tapes, where random access was expensive or unavailable. The algorithm writes sorted runs to multiple tapes, then repeatedly merges runs from input tapes onto output tapes.
The same idea still matters for sequential external storage: make long sequential reads and writes, avoid random seeks, and reduce the number of full passes over the data.
Problem
Given a dataset too large to fit in memory, sort it using only sequential reads and writes.
The input is divided into sorted runs. These runs are then merged until one sorted run remains.
Core Idea
Tape merge sort uses two phases:
Run generation Read chunks into memory, sort them, and write sorted runs to output tapes.
Tape merging Merge runs from input tapes to output tapes. After each pass, swap input and output tape roles.
Algorithm
tape_merge_sort(input, memory_size, tapes):
runs = generate_sorted_runs(input, memory_size)
distribute runs across input tapes
while more than one run remains:
merge runs from input tapes onto output tapes
rewind tapes
swap input tapes and output tapes
return final runA two-way tape merge uses two input tapes and one output tape.
merge_pass(input_tape_1, input_tape_2, output_tape):
while both input tapes have runs:
merge one run from input_tape_1
with one run from input_tape_2
write merged run to output_tapeExample
Suppose initial sorted runs are:
Distribute them across two input tapes:
| tape | runs |
|---|---|
| T1 | |
| T2 |
First merge pass:
| output tape | merged runs |
|---|---|
| T3 | merge(), merge() |
Now there are two larger runs. A second pass merges those into one final run.
Correctness
Each initial run is sorted. A merge of two sorted runs produces a larger sorted run by repeatedly selecting the smaller current head element.
Each merge pass preserves all records and reduces the number of runs. Since every pass produces sorted runs and the process stops only when one run remains, the final run contains all input records in sorted order.
Complexity
Let:
- be the number of records
- be the number of initial runs
- be the block size
- be the merge fan-in
Number of merge passes:
I/O cost:
CPU cost for k-way merging:
per merge level when using a heap.
Tape Scheduling
Tape merge sort differs from ordinary external merge sort mainly in scheduling.
| schedule | idea |
|---|---|
| balanced merge | distribute runs evenly across tapes |
| polyphase merge | distribute runs unevenly to reduce idle tapes |
| cascade merge | pass merged runs through staged levels |
For real tape devices, minimizing rewinds and redistribution can dominate performance.
When to Use
Tape merge sort is useful when:
- storage supports fast sequential access but poor random access
- data is much larger than memory
- runs must be processed in passes
- input and output streams are naturally sequential
It is mostly historical as a named method, but its I/O discipline remains relevant for external sorting and streaming pipelines.
Implementation Sketch
def merge_two_runs(a, b):
i = 0
j = 0
out = []
while i < len(a) and j < len(b):
if a[i] <= b[j]:
out.append(a[i])
i += 1
else:
out.append(b[j])
j += 1
out.extend(a[i:])
out.extend(b[j:])
return outdef tape_merge_sort_runs(runs):
while len(runs) > 1:
next_runs = []
for i in range(0, len(runs), 2):
if i + 1 < len(runs):
next_runs.append(merge_two_runs(runs[i], runs[i + 1]))
else:
next_runs.append(runs[i])
runs = next_runs
return runs[0] if runs else []Notes
Tape merge sort is best understood as merge sort under sequential-storage constraints. The sorting rule is ordinary merging. The important part is how runs are placed, read, written, rewound, and redistributed across sequential devices.