External sorting algorithm that performs run generation and merging concurrently across multiple processors and disks.
Parallel external merge sort extends external merge sort by using multiple processors, threads, and storage devices to increase throughput. It parallelizes both run generation and merging while keeping I/O mostly sequential.
The algorithm is used in large-scale systems where sorting terabytes of data requires distributing work across CPUs and disks.
Problem
Given a dataset much larger than memory, sort it efficiently using multiple processors and parallel I/O.
The goal is to reduce wall-clock time by exploiting concurrency while maintaining the same correctness guarantees as external merge sort.
Core Idea
The algorithm has the same structure as external merge sort, but both phases are parallelized:
Parallel run generation Split the input into chunks and sort them independently on different workers.
Parallel merging Merge sorted runs in parallel, either by partitioning the key space or by dividing merge tasks.
parallel_external_merge_sort(input, chunk_size, workers):
runs = parallel_generate_runs(input, chunk_size, workers)
return parallel_merge_runs(runs, workers)Phase 1: Parallel Run Generation
Each worker processes a disjoint portion of the input.
parallel_generate_runs(input, chunk_size, workers):
partition input into chunks
assign chunks to workers
in parallel:
for each chunk:
sort chunk
write run to disk
return all runsEach worker performs in-memory sorting independently.
Phase 2: Parallel Merge
There are several strategies for parallel merging.
Range Partitioning
Divide the key space into disjoint ranges and assign each range to a worker.
parallel_merge_by_range(runs, workers):
determine splitters for key ranges
in parallel:
for each worker:
merge elements belonging to its rangeMerge Partitioning
Split merging tasks directly:
parallel_merge(runs):
while more than one run:
in parallel:
merge pairs or groups of runsExample
Suppose:
- input size: 12 records
- memory per worker: 3 records
- workers: 2
Run Generation
Worker 1:
- [8, 3, 5] → [3, 5, 8]
- [1, 9, 2] → [1, 2, 9]
Worker 2:
- [7, 4, 6] → [4, 6, 7]
- [0, 11, 10] → [0, 10, 11]
Merge
Workers merge runs in parallel, either pairwise or by range, producing the final sorted output.
Correctness
Each worker produces sorted runs. Parallel merging combines sorted runs using standard merge rules. Each merge operation preserves sorted order by always selecting the smallest available element.
Parallel execution does not change the logical behavior. It only changes when and where comparisons occur. As long as each merge step respects ordering, the final output is sorted and contains all elements.
Complexity
Let:
- be the number of elements
- be the number of workers
- be memory per worker
- runs
- be merge fan-in
Run generation time:
Merge time per level:
I/O complexity remains:
but wall-clock time decreases with parallel I/O and computation.
Design Challenges
| issue | effect |
|---|---|
| load balancing | uneven chunk sizes slow workers |
| I/O contention | multiple workers compete for disks |
| synchronization | merging phases require coordination |
| skewed keys | range partitioning may be unbalanced |
| memory allocation | each worker needs sufficient buffer space |
When to Use
Use parallel external merge sort when:
- data is very large
- multiple CPU cores or nodes are available
- storage supports parallel access
- throughput is more important than simplicity
- sorting is a major bottleneck
It is common in distributed systems, big data frameworks, and high-performance databases.
Implementation Sketch
import heapq
def merge_runs(runs):
heap = []
iters = [iter(run) for run in runs]
for i, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), i))
except StopIteration:
pass
out = []
while heap:
val, i = heapq.heappop(heap)
out.append(val)
try:
heapq.heappush(heap, (next(iters[i]), i))
except StopIteration:
pass
return outdef parallel_generate_runs(data, chunk_size):
runs = []
for i in range(0, len(data), chunk_size):
runs.append(sorted(data[i:i + chunk_size]))
return runsdef parallel_external_merge_sort_model(data, chunk_size):
runs = parallel_generate_runs(data, chunk_size)
return merge_runs(runs)Notes
Parallel external merge sort preserves the structure of external merge sort but distributes work across resources. The main gains come from parallel run generation and overlapping computation with I/O.