External sorting algorithm that performs run generation and a single multiway merge using limited memory.
Two Phase Multiway Merge Sort (TPMMS) is a practical external sorting algorithm used in database systems. It assumes that all sorted runs can be merged in a single pass using available memory buffers.
The method minimizes the number of disk passes, which dominates cost in external memory settings.
Model
Let:
- = total number of elements
- = memory capacity
- = block size
Memory is divided into buffers:
- 1 output buffer
- input buffers
- Total buffers
High-Level Structure
TPMMS consists of exactly two phases:
- Run generation
- Single multiway merge
The key constraint:
so all runs can be merged in one pass.
Phase 1: Run Generation
Split input into chunks that fit into memory, sort each chunk, and write back as runs.
generate_runs(file):
runs = []
while not end_of_file:
chunk = read_next_M_elements(file)
sort(chunk)
run_file = write_to_disk(chunk)
runs.append(run_file)
return runsNumber of runs:
Phase 2: Multiway Merge
Merge all runs simultaneously using input buffers and one output buffer.
multiway_merge(runs):
initialize k input buffers
initialize output buffer
build min-heap with first element from each run
while heap not empty:
(value, r) = extract_min(heap)
append value to output buffer
if output buffer full:
flush to disk
if run r has more data:
read next element from buffer
push into heapExample
Suppose:
- Memory holds 3 elements
- Input: [8, 3, 5, 1, 9, 2, 7]
Phase 1
Runs:
- [3, 5, 8]
- [1, 2, 9]
- [7]
Phase 2
Merge all runs in one pass:
Result:
[1, 2, 3, 5, 7, 8, 9]
Complexity
I/O Cost
Phase 1 reads and writes all data once:
Phase 2 reads and writes once:
Total:
with two full passes.
CPU Cost
Heap operations:
where is number of runs.
Constraint for Two-Phase Property
To ensure only one merge pass:
which gives:
This is the classic TPMMS condition.
Design Notes
| aspect | implication |
|---|---|
| fewer runs | larger memory improves performance |
| larger merge fan-in | fewer passes |
| buffer size | affects I/O efficiency |
When to Use
TPMMS is appropriate when:
- dataset satisfies
- you want exactly two passes over data
- you build database operators like ORDER BY or sort-merge join
If the condition fails, additional merge passes are required.
Implementation Sketch
import heapq
def multiway_merge(runs):
heap = []
iters = [iter(r) for r in runs]
for i, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), i))
except StopIteration:
pass
result = []
while heap:
val, i = heapq.heappop(heap)
result.append(val)
try:
heapq.heappush(heap, (next(iters[i]), i))
except StopIteration:
pass
return resultNotes
TPMMS is the standard model for:
- relational database sorting
- external GROUP BY and ORDER BY
- sort-merge join preparation
Its importance comes from the guarantee of minimal passes when memory is sufficient.