Skip to content

External Merge Sort

Sort data that does not fit in memory by dividing it into sorted runs and merging them using external storage.

External merge sort is a disk-based sorting algorithm designed for datasets that exceed main memory. It minimizes random I/O and relies on sequential reads and writes, which are efficient on disks and object storage.

You use it when the input size is much larger than RAM, for example multi-gigabyte or terabyte scale data.

Model

Assume:

  • Input size: NN elements
  • Memory capacity: MM elements
  • Block size: BB elements per I/O

The goal is to sort using minimal disk passes.

High-Level Idea

The algorithm has two phases:

  1. Run generation Split data into chunks that fit in memory, sort each chunk, and write sorted runs to disk.

  2. Merge phase Merge multiple sorted runs into one final sorted sequence.

Algorithm

Phase 1: Run Generation

Read chunks of size MM, sort in memory, and write back.

generate_runs(file):
    runs = []
    while not end_of_file:
        chunk = read_next_M_elements(file)
        sort(chunk)
        run_file = write_to_disk(chunk)
        runs.append(run_file)
    return runs

Phase 2: K-way Merge

Merge multiple runs simultaneously using a min-heap.

merge_runs(runs):
    create min_heap
    open all run files

    for each run:
        read first element and push (value, run_id) into heap

    while heap not empty:
        (value, r) = extract_min(heap)
        output value

        if run r has more elements:
            read next element from r
            push into heap

Example

Suppose:

  • Memory holds 3 elements
  • Input: [8, 3, 5, 1, 9, 2, 7]

Step 1: Runs

Chunks:

  • [8, 3, 5] → [3, 5, 8]
  • [1, 9, 2] → [1, 2, 9]
  • [7] → [7]

Step 2: Merge

Merge:

[3,5,8], [1,2,9], [7]

Result:

[1,2,3,5,7,8,9]

Complexity

I/O Complexity

Number of passes:

O(logkNM) O\left(\log_k \frac{N}{M}\right)

where kk is number of runs merged at once.

Total I/O:

O(NBlogkNM) O\left(\frac{N}{B} \log_k \frac{N}{M}\right)

CPU Complexity

Each element participates in heap operations:

O(Nlogk) O(N \log k)

Design Choices

parametereffect
run size MMlarger runs reduce merge passes
merge degree kklarger kk reduces passes but increases heap cost
block size BBlarger blocks reduce I/O overhead

Optimizations

  • Replacement selection Produces runs larger than memory, often about 2M2M.

  • Buffered I/O Use read/write buffers to reduce system calls.

  • Multiway merge Merge many runs at once instead of pairwise merging.

  • Loser tree Replace heap with tournament tree for faster merging.

When to Use

External merge sort is appropriate when:

  • data does not fit in RAM
  • storage is disk, SSD, or object store
  • sequential I/O is much cheaper than random access

It is the standard sorting method in databases, large-scale data processing systems, and distributed frameworks.

Implementation (simplified Python)

import heapq

def merge_sorted_files(files):
    heap = []
    iters = [iter(f) for f in files]

    for i, it in enumerate(iters):
        try:
            value = next(it)
            heapq.heappush(heap, (value, i))
        except StopIteration:
            pass

    result = []

    while heap:
        value, i = heapq.heappop(heap)
        result.append(value)
        try:
            nxt = next(iters[i])
            heapq.heappush(heap, (nxt, i))
        except StopIteration:
            pass

    return result

Notes

External merge sort underlies many systems:

  • database sort operators
  • MapReduce shuffle phase
  • large-scale log processing

The key principle is simple: keep memory usage bounded, push bulk data to sequential disk operations, and merge efficiently.