Sort datasets that exceed main memory by organizing the algorithm around sequential disk access, merge passes, and minimizing I/O operations.
External sorting handles datasets that do not fit in main memory. The cost model changes from CPU time to I O. The goal is to minimize disk reads and writes by organizing the algorithm around sequential access and large transfers.
Problem
You have a dataset too large to fit into memory. You want to sort it using limited RAM and disk storage.
Model
Assume:
N = number of records
M = memory capacity in records
B = block size in records per disk transferThe dominant cost is the number of block transfers between disk and memory.
Solution Overview
External sorting uses two phases:
1. Run generation
2. Multiway mergeThe algorithm produces sorted chunks that fit in memory, writes them to disk, and then merges those chunks into a single sorted file.
Phase 1. Run Generation
Read up to M records into memory, sort them using an internal sorting algorithm, and write the sorted block to disk as a run.
generate_runs(input):
while not end of input:
read up to M records into buffer
sort buffer in memory
write buffer as a run to diskAfter this phase, you have approximately:
R = ceil(N / M)sorted runs on disk.
Phase 2. Multiway Merge
Merge multiple runs at once using a priority queue.
merge_runs(runs):
open each run with a buffer
heap = empty min-heap
for each run i:
read first block
push (value, run_id) into heap
while heap not empty:
(value, i) = pop heap
write value to output
if run i has more data:
read next value from run i
push (value, i) into heapEach run contributes its smallest remaining element to the heap. The heap ensures the next output element is globally smallest.
Merge Degree
The number of runs merged at once depends on memory.
If each run needs one input buffer and one output buffer, the number of runs merged in one pass is approximately:
k ≈ M / B - 1A higher merge degree reduces the number of merge passes.
Number of Passes
The total number of merge passes is:
ceil(log_k R)where:
R = number of initial runs
k = merge degreeEach pass reads and writes the entire dataset once.
Total I O Cost
Each pass processes all N records:
O(N / B) block transfers per passTotal I O cost:
O((N / B) * log_k (N / M))The algorithm is efficient when k is large, which requires large memory relative to block size.
Replacement Selection
Run generation can be improved using replacement selection. Instead of producing runs of size at most M, it can produce runs of average size about 2M.
The idea is to maintain a heap of size M:
replacement_selection(input):
heap = first M elements
last_output = -infinity
while heap not empty:
x = extract_min(heap)
output x
last_output = x
if input has next value y:
if y >= last_output:
insert y into heap
else:
mark y for next runElements that break the sorted order are deferred to the next run. This increases run length and reduces the number of merge passes.
Correctness
Each run is sorted by internal sorting or by replacement selection.
The merge phase maintains the invariant:
The heap contains the smallest unmerged element from each run.At each step, the smallest element across all runs is extracted and written to output. Since each run is sorted, and the heap always exposes the smallest candidate, the output is globally sorted.
The permutation condition holds because every input element is written exactly once during merging.
Stability
External merge sort is stable if:
runs are generated stably
merge uses stable tie-breakingWhen equal keys appear from different runs, the merge step should preserve the order in which runs were created or use explicit tie-breaking keys.
Implementation Notes
Sequential access is critical. Always read and write data in blocks of size B.
Use buffering to overlap I O and computation. Double buffering can allow one buffer to be filled while another is being processed.
Use binary files rather than text when performance matters. Parsing text can dominate runtime.
Choose k as large as memory allows. A higher merge degree reduces the number of passes and therefore reduces total I O.
Practical Variants
Two-phase multiway merge sort is common when the number of runs is small enough to merge in one pass.
Polyphase merge uses uneven run sizes to reduce the number of temporary files.
Parallel external sorting distributes runs and merges across multiple disks or machines.
Common Bugs
A common bug is using too small buffers, which increases I O overhead.
Another bug is performing random disk access. External sorting must use sequential access to be efficient.
A third bug is forgetting to flush output buffers, leading to incomplete output.
A fourth bug is not handling end-of-run correctly during merging, causing incorrect termination.
Takeaway
External sorting replaces CPU-bound work with I O-efficient structure. It relies on run generation and multiway merging, and its performance depends on minimizing passes over the data and maximizing sequential access.