# Parallel External Merge Sort

# Parallel External Merge Sort

Parallel external merge sort extends external merge sort by using multiple processors, threads, and storage devices to increase throughput. It parallelizes both run generation and merging while keeping I/O mostly sequential.

The algorithm is used in large-scale systems where sorting terabytes of data requires distributing work across CPUs and disks.

## Problem

Given a dataset much larger than memory, sort it efficiently using multiple processors and parallel I/O.

The goal is to reduce wall-clock time by exploiting concurrency while maintaining the same correctness guarantees as external merge sort.

## Core Idea

The algorithm has the same structure as external merge sort, but both phases are parallelized:

1. **Parallel run generation**
   Split the input into chunks and sort them independently on different workers.

2. **Parallel merging**
   Merge sorted runs in parallel, either by partitioning the key space or by dividing merge tasks.

```text id="f9u2sd"
parallel_external_merge_sort(input, chunk_size, workers):
    runs = parallel_generate_runs(input, chunk_size, workers)
    return parallel_merge_runs(runs, workers)
```

## Phase 1: Parallel Run Generation

Each worker processes a disjoint portion of the input.

```text id="9o2n1c"
parallel_generate_runs(input, chunk_size, workers):
    partition input into chunks
    assign chunks to workers

    in parallel:
        for each chunk:
            sort chunk
            write run to disk

    return all runs
```

Each worker performs in-memory sorting independently.

## Phase 2: Parallel Merge

There are several strategies for parallel merging.

### Range Partitioning

Divide the key space into disjoint ranges and assign each range to a worker.

```text id="2o8vsz"
parallel_merge_by_range(runs, workers):
    determine splitters for key ranges

    in parallel:
        for each worker:
            merge elements belonging to its range
```

### Merge Partitioning

Split merging tasks directly:

```text id="q4k9z1"
parallel_merge(runs):
    while more than one run:
        in parallel:
            merge pairs or groups of runs
```

## Example

Suppose:

* input size: 12 records
* memory per worker: 3 records
* workers: 2

### Run Generation

Worker 1:

* [8, 3, 5] → [3, 5, 8]
* [1, 9, 2] → [1, 2, 9]

Worker 2:

* [7, 4, 6] → [4, 6, 7]
* [0, 11, 10] → [0, 10, 11]

### Merge

Workers merge runs in parallel, either pairwise or by range, producing the final sorted output.

## Correctness

Each worker produces sorted runs. Parallel merging combines sorted runs using standard merge rules. Each merge operation preserves sorted order by always selecting the smallest available element.

Parallel execution does not change the logical behavior. It only changes when and where comparisons occur. As long as each merge step respects ordering, the final output is sorted and contains all elements.

## Complexity

Let:

* $N$ be the number of elements
* $P$ be the number of workers
* $M$ be memory per worker
* $R = \lceil N / M \rceil$ runs
* $k$ be merge fan-in

Run generation time:

$$
O\left(\frac{N}{P} \log M\right)
$$

Merge time per level:

$$
O\left(\frac{N}{P} \log k\right)
$$

I/O complexity remains:

$$
O\left(\frac{N}{B} \log_k R\right)
$$

but wall-clock time decreases with parallel I/O and computation.

## Design Challenges

| issue             | effect                                    |
| ----------------- | ----------------------------------------- |
| load balancing    | uneven chunk sizes slow workers           |
| I/O contention    | multiple workers compete for disks        |
| synchronization   | merging phases require coordination       |
| skewed keys       | range partitioning may be unbalanced      |
| memory allocation | each worker needs sufficient buffer space |

## When to Use

Use parallel external merge sort when:

* data is very large
* multiple CPU cores or nodes are available
* storage supports parallel access
* throughput is more important than simplicity
* sorting is a major bottleneck

It is common in distributed systems, big data frameworks, and high-performance databases.

## Implementation Sketch

```python id="l7k2qs"
import heapq

def merge_runs(runs):
    heap = []
    iters = [iter(run) for run in runs]

    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass

    out = []

    while heap:
        val, i = heapq.heappop(heap)
        out.append(val)
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

    return out
```

```python id="p3z9rt"
def parallel_generate_runs(data, chunk_size):
    runs = []
    for i in range(0, len(data), chunk_size):
        runs.append(sorted(data[i:i + chunk_size]))
    return runs
```

```python id="z6c8xv"
def parallel_external_merge_sort_model(data, chunk_size):
    runs = parallel_generate_runs(data, chunk_size)
    return merge_runs(runs)
```

## Notes

Parallel external merge sort preserves the structure of external merge sort but distributes work across resources. The main gains come from parallel run generation and overlapping computation with I/O.