Generate long initial runs for external sorting using replacement selection with disk-based streams.
External replacement selection is the disk-oriented version of replacement selection. It generates long sorted runs from data that does not fit in memory, forming the first phase of external merge sort.
You use it when sorting large datasets stored on disk.
Problem
Given a dataset too large to fit into memory and a memory buffer of size ( M ), generate sorted runs on disk such that:
- each run is sorted
- runs are as long as possible
- disk I/O is minimized
Idea
Maintain a min heap in memory. Repeatedly output the smallest element and replace it with the next input element. If the new element is smaller than the last output, defer it to the next run.
This produces runs that are typically longer than the memory size.
Algorithm
external_replacement_selection(stream, M):
heap = load_first_M_elements(stream)
build_min_heap(heap)
last_output = -infinity
next_run = []
while heap not empty:
x = heap.pop()
write_to_disk(x)
last_output = x
if stream not empty:
y = read_next(stream)
if y >= last_output:
heap.push(y)
else:
next_run.append(y)
if heap empty:
write_run_boundary()
heap = next_run
build_min_heap(heap)
next_run = []
last_output = -infinityExample
Memory size:
[ M = 4 ]
Input stream:
[ [6, 2, 9, 3, 1, 8, 4, 7, 5] ]
First run might become:
[ [2, 3, 6, 8, 9] ]
Deferred elements:
[ [1, 4, 7, 5] ]
Next run processes deferred elements.
Correctness
The heap always contains candidates for the current run. Any element smaller than the last output would violate sorted order, so it is deferred.
Thus:
- each run is sorted
- all elements are eventually included in some run
Run Length
Expected run length under random input:
[ \approx 2M ]
This reduces the number of runs and improves merge efficiency.
Complexity
| operation | cost |
|---|---|
| heap operations | ( O(\log M) ) |
| total time | ( O(n \log M) ) |
| disk I/O | sequential reads and writes |
Space:
[ O(M) ]
Practical Considerations
- disk access dominates cost, so sequential I/O is critical
- buffering should be used for reads and writes
- run boundaries must be recorded for merging
When to Use
Use external replacement selection when:
- data does not fit in memory
- building initial runs for external merge sort
- minimizing number of merge passes matters
It is a core technique in database systems and large-scale data processing.