Sort data that does not fit in memory by dividing it into sorted runs and merging them using external storage.
External merge sort is a disk-based sorting algorithm designed for datasets that exceed main memory. It minimizes random I/O and relies on sequential reads and writes, which are efficient on disks and object storage.
You use it when the input size is much larger than RAM, for example multi-gigabyte or terabyte scale data.
Model
Assume:
- Input size: elements
- Memory capacity: elements
- Block size: elements per I/O
The goal is to sort using minimal disk passes.
High-Level Idea
The algorithm has two phases:
Run generation Split data into chunks that fit in memory, sort each chunk, and write sorted runs to disk.
Merge phase Merge multiple sorted runs into one final sorted sequence.
Algorithm
Phase 1: Run Generation
Read chunks of size , sort in memory, and write back.
generate_runs(file):
runs = []
while not end_of_file:
chunk = read_next_M_elements(file)
sort(chunk)
run_file = write_to_disk(chunk)
runs.append(run_file)
return runsPhase 2: K-way Merge
Merge multiple runs simultaneously using a min-heap.
merge_runs(runs):
create min_heap
open all run files
for each run:
read first element and push (value, run_id) into heap
while heap not empty:
(value, r) = extract_min(heap)
output value
if run r has more elements:
read next element from r
push into heapExample
Suppose:
- Memory holds 3 elements
- Input: [8, 3, 5, 1, 9, 2, 7]
Step 1: Runs
Chunks:
- [8, 3, 5] → [3, 5, 8]
- [1, 9, 2] → [1, 2, 9]
- [7] → [7]
Step 2: Merge
Merge:
[3,5,8], [1,2,9], [7]
Result:
[1,2,3,5,7,8,9]
Complexity
I/O Complexity
Number of passes:
where is number of runs merged at once.
Total I/O:
CPU Complexity
Each element participates in heap operations:
Design Choices
| parameter | effect |
|---|---|
| run size | larger runs reduce merge passes |
| merge degree | larger reduces passes but increases heap cost |
| block size | larger blocks reduce I/O overhead |
Optimizations
Replacement selection Produces runs larger than memory, often about .
Buffered I/O Use read/write buffers to reduce system calls.
Multiway merge Merge many runs at once instead of pairwise merging.
Loser tree Replace heap with tournament tree for faster merging.
When to Use
External merge sort is appropriate when:
- data does not fit in RAM
- storage is disk, SSD, or object store
- sequential I/O is much cheaper than random access
It is the standard sorting method in databases, large-scale data processing systems, and distributed frameworks.
Implementation (simplified Python)
import heapq
def merge_sorted_files(files):
heap = []
iters = [iter(f) for f in files]
for i, it in enumerate(iters):
try:
value = next(it)
heapq.heappush(heap, (value, i))
except StopIteration:
pass
result = []
while heap:
value, i = heapq.heappop(heap)
result.append(value)
try:
nxt = next(iters[i])
heapq.heappush(heap, (nxt, i))
except StopIteration:
pass
return resultNotes
External merge sort underlies many systems:
- database sort operators
- MapReduce shuffle phase
- large-scale log processing
The key principle is simple: keep memory usage bounded, push bulk data to sequential disk operations, and merge efficiently.