20.22 Building an External Merge Sort
Design a sorting tool that can sort data larger than available memory.
20.22 Building an External Merge Sort
Problem
Design a sorting tool that can sort data larger than available memory.
External sorting appears in databases, log processors, search indexing pipelines, ETL jobs, compilers, analytics engines, and storage systems. In-memory sorting is straightforward when all records fit in RAM. When the input is larger than memory, the algorithm must use disk.
Suppose you need to sort this file:
42
7
19
3
88
For a small file, you can load all values, sort them, and write them back. For a 500 GB file on a machine with 8 GB of memory, that approach fails.
The core algorithmic problem is sorting under a memory limit. The engineering problem is managing files, buffering, temporary runs, and cleanup safely.
Solution
Use external merge sort.
External merge sort has two phases:
Phase 1: read chunks, sort each chunk in memory, write sorted runs
Phase 2: merge sorted runs into one sorted output
The structure is the same as merge sort, but the sorted subarrays are files rather than memory slices.
Phase 1: Creating Sorted Runs
Read as many records as fit in memory. Sort them. Write the sorted chunk to a temporary file.
import heapq
import os
import tempfile
def create_sorted_runs(input_path, max_records_per_run):
run_paths = []
with open(input_path, "r", encoding="utf-8") as input_file:
while True:
records = []
for _ in range(max_records_per_run):
line = input_file.readline()
if not line:
break
records.append(parse_record(line))
if not records:
break
records.sort()
run_path = write_run(records)
run_paths.append(run_path)
return run_paths
Record parsing:
def parse_record(line):
return int(line.strip())
Run writing:
def write_run(records):
temp = tempfile.NamedTemporaryFile(
mode="w",
encoding="utf-8",
delete=False,
)
try:
for record in records:
temp.write(f"{record}\n")
return temp.name
finally:
temp.close()
At the end of this phase, each temporary file is sorted.
Example:
input: 42 7 19 3 88
chunk size: 2
run 1: 7 42
run 2: 3 19
run 3: 88
Phase 2: K-Way Merge
To merge many sorted runs, open each run and keep its current smallest record in a heap.
class RunReader:
def __init__(self, path):
self.path = path
self.file = open(path, "r", encoding="utf-8")
def read_next(self):
line = self.file.readline()
if not line:
return None
return parse_record(line)
def close(self):
self.file.close()
Merge:
def merge_runs(run_paths, output_path):
readers = []
heap = []
try:
for index, path in enumerate(run_paths):
reader = RunReader(path)
readers.append(reader)
record = reader.read_next()
if record is not None:
heapq.heappush(heap, (record, index))
with open(output_path, "w", encoding="utf-8") as output_file:
while heap:
record, index = heapq.heappop(heap)
output_file.write(f"{record}\n")
next_record = readers[index].read_next()
if next_record is not None:
heapq.heappush(heap, (next_record, index))
finally:
for reader in readers:
reader.close()
The heap always contains the next available record from each run. The smallest heap element is the next sorted output.
End-to-End Sort
Combine the phases and clean up temporary files.
def external_sort(input_path, output_path, max_records_per_run):
run_paths = create_sorted_runs(
input_path,
max_records_per_run,
)
try:
merge_runs(run_paths, output_path)
finally:
for path in run_paths:
try:
os.remove(path)
except FileNotFoundError:
pass
Use it:
external_sort(
input_path="numbers.txt",
output_path="sorted_numbers.txt",
max_records_per_run=1_000_000,
)
This works for inputs much larger than memory because only one chunk and one record per run need to be held at a time.
Why K-Way Merge Works
Each run is already sorted. The next output record must be the smallest among the current heads of all runs.
The heap maintains exactly those current heads.
For runs:
R1: 7 42
R2: 3 19
R3: 88
Initial heap:
3 from R2
7 from R1
88 from R3
Pop 3, then read the next value from R2, which is 19.
Heap:
7 from R1
19 from R2
88 from R3
Repeat until all runs are exhausted.
This is the same invariant as normal merge sort, generalized from two runs to many runs.
Complexity
Let:
nbe the total number of records.rbe the number of sorted runs.
Run creation reads all records, sorts each chunk, and writes all records once.
The merge phase reads all records again and writes all records again.
| Phase | Complexity |
|---|---|
| Create runs | O(n log m) |
| Merge runs | O(n log r) |
| Disk reads | O(n) per pass |
| Disk writes | O(n) per pass |
| Memory | O(m + r) |
Here m is the number of records per memory chunk.
If r is too large, the heap is large and too many files are open. In that case, merge in multiple passes.
Multi-Pass Merge
Operating systems limit the number of files a process can open. Also, a huge merge heap may be inefficient.
Merge runs in groups.
def merge_in_passes(run_paths, max_open_runs):
current_runs = run_paths
while len(current_runs) > 1:
next_runs = []
for i in range(0, len(current_runs), max_open_runs):
group = current_runs[i:i + max_open_runs]
output_run = tempfile.NamedTemporaryFile(delete=False).name
merge_runs(group, output_run)
next_runs.append(output_run)
for path in group:
os.remove(path)
current_runs = next_runs
return current_runs[0]
Then copy or rename the final run to the output path.
def external_sort_multipass(input_path, output_path, max_records_per_run, max_open_runs):
run_paths = create_sorted_runs(
input_path,
max_records_per_run,
)
if not run_paths:
open(output_path, "w", encoding="utf-8").close()
return
final_run = merge_in_passes(run_paths, max_open_runs)
os.replace(final_run, output_path)
Multi-pass merging trades additional disk I/O for lower file-handle pressure.
Sorting Records by Key
Real records are usually structured.
Example:
user_id,timestamp,event
42,1003,view
17,1001,click
42,1000,login
Sort by (user_id, timestamp).
def parse_csv_record(line):
user_id, timestamp, event = line.rstrip("\n").split(",")
return {
"user_id": int(user_id),
"timestamp": int(timestamp),
"event": event,
"raw": line.rstrip("\n"),
}
def record_key(record):
return (
record["user_id"],
record["timestamp"],
)
Sort chunks by key:
records.sort(key=record_key)
For merging, heap entries should include the key and a tie-breaker.
heapq.heappush(
heap,
(record_key(record), index, record)
)
The tie-breaker prevents Python from trying to compare dictionary records when keys are equal.
Stable Sorting
A stable sort preserves input order for equal keys.
This matters when sorting logs by timestamp but preserving original order within the same timestamp.
Add a global sequence number during run creation.
def create_sorted_runs_stable(input_path, max_records_per_run):
run_paths = []
sequence = 0
with open(input_path, "r", encoding="utf-8") as input_file:
while True:
records = []
for _ in range(max_records_per_run):
line = input_file.readline()
if not line:
break
record = parse_csv_record(line)
records.append((record_key(record), sequence, record))
sequence += 1
if not records:
break
records.sort(key=lambda item: (item[0], item[1]))
run_path = write_stable_run(records)
run_paths.append(run_path)
return run_paths
The sequence number becomes part of the sort key. Equal keys are emitted in original order.
Buffering
Reading one line at a time is simple but may be slow. File objects already buffer internally, but large external sorts often use explicit buffering.
Important buffering concerns:
| Buffer | Purpose |
|---|---|
| Input buffer | Read chunks efficiently |
| Output buffer | Avoid small writes |
| Run buffer | Keep more than one record per run in memory |
| Compression buffer | Reduce disk I/O at CPU cost |
The algorithmic structure remains the same. Buffering changes constant factors.
Temporary File Management
Temporary files must be cleaned up even if sorting fails.
Use try and finally.
def safe_external_sort(input_path, output_path, max_records_per_run):
run_paths = []
try:
run_paths = create_sorted_runs(
input_path,
max_records_per_run,
)
merge_runs(run_paths, output_path)
finally:
for path in run_paths:
if os.path.exists(path):
os.remove(path)
For production systems, use a dedicated temporary directory with enough disk space.
tempfile.NamedTemporaryFile(dir="/data/tmp-sort", delete=False)
Sorting fails badly when temporary storage fills up. Estimate space before starting.
Disk Space Estimate
External merge sort usually needs temporary space roughly equal to the input size, sometimes more during multi-pass merging.
A conservative estimate:
temporary space >= input size × 2
If compression is enabled, this may be lower. If multi-pass merging overlaps input and output runs, this may be higher.
A tool should check available space before starting.
import shutil
def ensure_temp_space(temp_dir, required_bytes):
usage = shutil.disk_usage(temp_dir)
if usage.free < required_bytes:
raise RuntimeError(
f"not enough temporary space: need {required_bytes}, "
f"have {usage.free}"
)
Replacement Selection
A more advanced run creation strategy is replacement selection. It uses a heap to produce runs larger than memory on partially ordered input.
Conceptually:
read memory-sized heap
repeatedly output smallest item
replace it with next input item if it preserves order
otherwise hold it for next run
Average run length can approach twice available memory for random input.
Replacement selection improves performance by reducing the number of runs, but it is more complex than chunk sorting. Use it when external sort performance is central.
Parallelism
Run creation is easy to parallelize. Different chunks can be sorted independently.
reader -> chunks -> worker sorters -> run files -> merge
The merge phase can also be parallelized by merging groups of runs concurrently, then merging the merged outputs.
Parallelism is limited by disk bandwidth. On many systems, adding CPU workers helps until storage becomes the bottleneck.
Testing
Test a small numeric file.
def test_external_sort_numbers(tmp_path):
input_path = tmp_path / "input.txt"
output_path = tmp_path / "output.txt"
input_path.write_text("42\n7\n19\n3\n", encoding="utf-8")
external_sort(
str(input_path),
str(output_path),
max_records_per_run=2,
)
assert output_path.read_text(encoding="utf-8") == "3\n7\n19\n42\n"
Test empty input.
def test_external_sort_empty_file(tmp_path):
input_path = tmp_path / "input.txt"
output_path = tmp_path / "output.txt"
input_path.write_text("", encoding="utf-8")
external_sort_multipass(
str(input_path),
str(output_path),
max_records_per_run=2,
max_open_runs=2,
)
assert output_path.read_text(encoding="utf-8") == ""
Test duplicates.
def test_external_sort_duplicates(tmp_path):
input_path = tmp_path / "input.txt"
output_path = tmp_path / "output.txt"
input_path.write_text("2\n1\n2\n1\n", encoding="utf-8")
external_sort(
str(input_path),
str(output_path),
max_records_per_run=2,
)
assert output_path.read_text(encoding="utf-8") == "1\n1\n2\n2\n"
Use small run sizes in tests to force multiple runs and exercise the merge logic.
Common Bugs
The most common external sort bug is loading too much data into memory during merging. The merge should keep only the current record from each run, not entire run files.
Another common bug is exceeding the open-file limit. Use multi-pass merging when the number of runs is large.
Temporary files are often left behind after failures. Always clean them in finally.
Sort keys can be inconsistent between run creation and merging. Use the same key function in both phases.
Duplicates can break heap comparisons if records themselves are not comparable. Include a tie-breaker.
Empty input must produce an empty output file.
Stable sorting requires carrying original sequence numbers across runs.
Temporary disk exhaustion should be detected early when possible.
Recipe
Build external merge sort in layers.
Start with run creation: read a memory-bounded chunk, sort it, and write a sorted temporary file. Add k-way merge with a heap that stores one current record per run. Clean up temporary files reliably. Add multi-pass merging when open-file limits matter. Generalize from numbers to records with a key function. Add tie-breakers for duplicate keys. Add stable sorting if equal-key order matters. Check temporary disk space and benchmark with realistic input sizes.
The main lesson is that external sorting is ordinary sorting adapted to the memory hierarchy. The algorithm minimizes memory use by turning one large sort into many in-memory sorts plus sequential merges. Disk I/O, buffering, temporary storage, and cleanup are as important as the comparison logic.