# External Merge Sort

# External Merge Sort

External merge sort is a disk-based sorting algorithm designed for datasets that exceed main memory. It minimizes random I/O and relies on sequential reads and writes, which are efficient on disks and object storage.

You use it when the input size is much larger than RAM, for example multi-gigabyte or terabyte scale data.

## Model

Assume:

* Input size: $N$ elements
* Memory capacity: $M$ elements
* Block size: $B$ elements per I/O

The goal is to sort using minimal disk passes.

## High-Level Idea

The algorithm has two phases:

1. **Run generation**
   Split data into chunks that fit in memory, sort each chunk, and write sorted runs to disk.

2. **Merge phase**
   Merge multiple sorted runs into one final sorted sequence.

## Algorithm

### Phase 1: Run Generation

Read chunks of size $M$, sort in memory, and write back.

```text
generate_runs(file):
    runs = []
    while not end_of_file:
        chunk = read_next_M_elements(file)
        sort(chunk)
        run_file = write_to_disk(chunk)
        runs.append(run_file)
    return runs
```

### Phase 2: K-way Merge

Merge multiple runs simultaneously using a min-heap.

```text
merge_runs(runs):
    create min_heap
    open all run files

    for each run:
        read first element and push (value, run_id) into heap

    while heap not empty:
        (value, r) = extract_min(heap)
        output value

        if run r has more elements:
            read next element from r
            push into heap
```

## Example

Suppose:

* Memory holds 3 elements
* Input: [8, 3, 5, 1, 9, 2, 7]

### Step 1: Runs

Chunks:

* [8, 3, 5] → [3, 5, 8]
* [1, 9, 2] → [1, 2, 9]
* [7] → [7]

### Step 2: Merge

Merge:

[3,5,8], [1,2,9], [7]

Result:

[1,2,3,5,7,8,9]

## Complexity

### I/O Complexity

Number of passes:

$$
O\left(\log_k \frac{N}{M}\right)
$$

where $k$ is number of runs merged at once.

Total I/O:

$$
O\left(\frac{N}{B} \log_k \frac{N}{M}\right)
$$

### CPU Complexity

Each element participates in heap operations:

$$
O(N \log k)
$$

## Design Choices

| parameter        | effect                                            |
| ---------------- | ------------------------------------------------- |
| run size $M$     | larger runs reduce merge passes                   |
| merge degree $k$ | larger $k$ reduces passes but increases heap cost |
| block size $B$   | larger blocks reduce I/O overhead                 |

## Optimizations

* **Replacement selection**
  Produces runs larger than memory, often about $2M$.

* **Buffered I/O**
  Use read/write buffers to reduce system calls.

* **Multiway merge**
  Merge many runs at once instead of pairwise merging.

* **Loser tree**
  Replace heap with tournament tree for faster merging.

## When to Use

External merge sort is appropriate when:

* data does not fit in RAM
* storage is disk, SSD, or object store
* sequential I/O is much cheaper than random access

It is the standard sorting method in databases, large-scale data processing systems, and distributed frameworks.

## Implementation (simplified Python)

```python
import heapq

def merge_sorted_files(files):
    heap = []
    iters = [iter(f) for f in files]

    for i, it in enumerate(iters):
        try:
            value = next(it)
            heapq.heappush(heap, (value, i))
        except StopIteration:
            pass

    result = []

    while heap:
        value, i = heapq.heappop(heap)
        result.append(value)
        try:
            nxt = next(iters[i])
            heapq.heappush(heap, (nxt, i))
        except StopIteration:
            pass

    return result
```

## Notes

External merge sort underlies many systems:

* database sort operators
* MapReduce shuffle phase
* large-scale log processing

The key principle is simple: keep memory usage bounded, push bulk data to sequential disk operations, and merge efficiently.

