Generate long sorted runs from a stream using a heap, commonly used in external sorting.
Replacement selection is a run generation technique used in external sorting. It produces sorted runs that are often longer than the available memory size by using a priority queue.
You use it when sorting data that does not fit in memory.
Problem
Given a stream of elements and limited memory of size ( M ), produce sorted runs on disk such that:
- each run is sorted
- runs are as long as possible
These runs are later merged.
Idea
Maintain a min heap of size ( M ). At each step:
- output the smallest element
- replace it with a new element from the input
- if the new element is smaller than the last output, defer it to the next run
This allows runs to grow beyond ( M ).
Algorithm
replacement_selection(stream, M):
heap = read_first_M_elements(stream)
build_min_heap(heap)
last_output = -infinity
next_run = []
while heap not empty:
x = heap.pop()
output(x)
last_output = x
if stream not empty:
y = next(stream)
if y >= last_output:
heap.push(y)
else:
next_run.append(y)
if heap empty:
write_run_boundary()
heap = next_run
build_min_heap(heap)
next_run = []
last_output = -infinityExample
Memory size ( M = 3 )
Input:
( [5, 2, 8, 1, 7, 3, 6, 4] )
Process:
- initial heap: [2, 5, 8]
- output 2 → insert 1 (deferred)
- output 5 → insert 7
- output 7 → insert 3 (deferred)
- output 8 → insert 6 (deferred)
First run:
( [2, 5, 7, 8] )
Deferred elements form next run.
Correctness
The heap always contains candidates that can extend the current run. Any element smaller than the last output cannot belong to the current sorted sequence, so it is deferred. This guarantees that each run remains sorted.
Run Length
Expected run length:
[ \approx 2M ]
under random input assumptions. This is significantly larger than memory size.
Complexity
| operation | cost |
|---|---|
| heap operations | ( O(\log M) ) |
| total | ( O(n \log M) ) |
Space:
[ O(M) ]
When to Use
Replacement selection is useful when:
- sorting very large datasets
- building initial runs for external merge sort
- disk I/O dominates cost
It reduces the number of runs, which lowers the number of merge passes required later.