External sorting technique that uses a heap to produce initial sorted runs longer than the available memory.
Replacement selection run generation is the first phase of an external sort. It produces sorted runs from an input stream using a heap. Unlike simple chunk sorting, which creates runs of about memory size, replacement selection often creates runs about twice as large when input order is random.
Longer runs reduce the number of later merge passes.
Problem
Given an input stream too large for memory, produce sorted runs using memory for only records.
The output is not one final sorted file yet. It is a sequence of sorted runs that will later be merged.
Core Idea
Keep a min-heap of active records. Repeatedly output the smallest active record. When a new input record arrives:
- if it is at least as large as the last output record, insert it into the current run
- otherwise freeze it for the next run
When all active records are exhausted, start a new run using the frozen records.
Algorithm
replacement_selection(input, M):
heap = first M records from input
frozen = empty heap
runs = []
current_run = []
last_output = -infinity
while heap is not empty:
x = extract_min(heap)
append x to current_run
last_output = x
if input has next record:
y = read next record
if y >= last_output:
insert y into heap
else:
insert y into frozen
if heap is empty:
append current_run to runs
current_run = []
heap = frozen
frozen = empty heap
last_output = -infinity
return runsExample
Let memory hold 3 records.
Input:
Initial heap:
Process:
| output | new input | action |
|---|---|---|
| 1 | 7 | active, since 7 >= 1 |
| 4 | 2 | frozen, since 2 < 4 |
| 7 | 9 | active, since 9 >= 7 |
| 8 | 3 | frozen, since 3 < 8 |
| 9 | 6 | frozen, since 6 < 9 |
Current run:
Frozen records become the next heap:
Next run:
Correctness
The active heap always contains records that can still extend the current run. Since the algorithm extracts the minimum active record, every emitted record is at least as large as the previous emitted record.
Records smaller than the last output cannot appear in the current run without breaking sorted order, so they are frozen for the next run. Every input record is inserted into exactly one heap and eventually emitted. Therefore each produced run is sorted and all records are preserved.
Complexity
Let:
- be the number of records
- be the heap capacity
Each record is inserted into and removed from a heap once.
CPU time:
Memory usage:
I/O is sequential:
records read and written, excluding later merge phases.
Expected Run Length
For random input, replacement selection often produces runs with expected length about:
This is an average-case heuristic under common assumptions. Actual run length depends on input order.
| input pattern | run behavior |
|---|---|
| already sorted | one very long run |
| random | about per run |
| reverse sorted | about per run |
When to Use
Use replacement selection when:
- building initial runs for external merge sort
- input is streamed
- memory is limited
- reducing merge passes matters
- input is not strongly reverse sorted
For modern systems with fast SSDs and large memory, simple chunk sorting may be easier. Replacement selection remains important in database and external sorting theory.
Implementation Sketch
import heapq
import math
def replacement_selection(data, memory_size):
it = iter(data)
heap = []
frozen = []
runs = []
current = []
last_output = -math.inf
for _ in range(memory_size):
try:
heapq.heappush(heap, next(it))
except StopIteration:
break
while heap:
x = heapq.heappop(heap)
current.append(x)
last_output = x
try:
y = next(it)
if y >= last_output:
heapq.heappush(heap, y)
else:
heapq.heappush(frozen, y)
except StopIteration:
pass
if not heap:
runs.append(current)
current = []
heap, frozen = frozen, []
last_output = -math.inf
return runsNotes
Replacement selection improves external sort by making the first runs longer. Longer runs mean fewer runs to merge, which can reduce the number of expensive external merge passes.