Parallel External Merge Sort

Parallel external merge sort extends external merge sort by using multiple processors, threads, and storage devices to increase throughput. It parallelizes both run generation and merging while keeping I/O mostly sequential.

The algorithm is used in large-scale systems where sorting terabytes of data requires distributing work across CPUs and disks.

Problem

Given a dataset much larger than memory, sort it efficiently using multiple processors and parallel I/O.

The goal is to reduce wall-clock time by exploiting concurrency while maintaining the same correctness guarantees as external merge sort.

Core Idea

The algorithm has the same structure as external merge sort, but both phases are parallelized:

Parallel run generation Split the input into chunks and sort them independently on different workers.
Parallel merging Merge sorted runs in parallel, either by partitioning the key space or by dividing merge tasks.

parallel_external_merge_sort(input, chunk_size, workers):
    runs = parallel_generate_runs(input, chunk_size, workers)
    return parallel_merge_runs(runs, workers)

Phase 1: Parallel Run Generation

Each worker processes a disjoint portion of the input.

parallel_generate_runs(input, chunk_size, workers):
    partition input into chunks
    assign chunks to workers

    in parallel:
        for each chunk:
            sort chunk
            write run to disk

    return all runs

Each worker performs in-memory sorting independently.

Phase 2: Parallel Merge

There are several strategies for parallel merging.

Range Partitioning

Divide the key space into disjoint ranges and assign each range to a worker.

parallel_merge_by_range(runs, workers):
    determine splitters for key ranges

    in parallel:
        for each worker:
            merge elements belonging to its range

Merge Partitioning

Split merging tasks directly:

parallel_merge(runs):
    while more than one run:
        in parallel:
            merge pairs or groups of runs

Example

Suppose:

input size: 12 records
memory per worker: 3 records
workers: 2

Run Generation

Worker 1:

[8, 3, 5] → [3, 5, 8]
[1, 9, 2] → [1, 2, 9]

Worker 2:

[7, 4, 6] → [4, 6, 7]
[0, 11, 10] → [0, 10, 11]

Merge

Workers merge runs in parallel, either pairwise or by range, producing the final sorted output.

Correctness

Each worker produces sorted runs. Parallel merging combines sorted runs using standard merge rules. Each merge operation preserves sorted order by always selecting the smallest available element.

Parallel execution does not change the logical behavior. It only changes when and where comparisons occur. As long as each merge step respects ordering, the final output is sorted and contains all elements.

Complexity

Let:

$N$ be the number of elements
$P$ be the number of workers
$M$ be memory per worker
$R = \lceil N / M \rceil$ runs
$k$ be merge fan-in

Run generation time:

O\left(\frac{N}{P} \log M\right)

Merge time per level:

O\left(\frac{N}{P} \log k\right)

I/O complexity remains:

O\left(\frac{N}{B} \log_k R\right)

but wall-clock time decreases with parallel I/O and computation.

Design Challenges

issue	effect
load balancing	uneven chunk sizes slow workers
I/O contention	multiple workers compete for disks
synchronization	merging phases require coordination
skewed keys	range partitioning may be unbalanced
memory allocation	each worker needs sufficient buffer space

When to Use

Use parallel external merge sort when:

data is very large
multiple CPU cores or nodes are available
storage supports parallel access
throughput is more important than simplicity
sorting is a major bottleneck

It is common in distributed systems, big data frameworks, and high-performance databases.

Implementation Sketch

import heapq

def merge_runs(runs):
    heap = []
    iters = [iter(run) for run in runs]

    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass

    out = []

    while heap:
        val, i = heapq.heappop(heap)
        out.append(val)
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

    return out

def parallel_generate_runs(data, chunk_size):
    runs = []
    for i in range(0, len(data), chunk_size):
        runs.append(sorted(data[i:i + chunk_size]))
    return runs

def parallel_external_merge_sort_model(data, chunk_size):
    runs = parallel_generate_runs(data, chunk_size)
    return merge_runs(runs)

Notes

Parallel external merge sort preserves the structure of external merge sort but distributes work across resources. The main gains come from parallel run generation and overlapping computation with I/O.