6.15 External Sorting

External sorting handles datasets that do not fit in main memory. The cost model changes from CPU time to I O. The goal is to minimize disk reads and writes by organizing the algorithm around sequential access and large transfers.

Problem

You have a dataset too large to fit into memory. You want to sort it using limited RAM and disk storage.

Model

Assume:

N = number of records
M = memory capacity in records
B = block size in records per disk transfer

The dominant cost is the number of block transfers between disk and memory.

Solution Overview

External sorting uses two phases:

1. Run generation
2. Multiway merge

The algorithm produces sorted chunks that fit in memory, writes them to disk, and then merges those chunks into a single sorted file.

Phase 1. Run Generation

Read up to M records into memory, sort them using an internal sorting algorithm, and write the sorted block to disk as a run.

generate_runs(input):
  while not end of input:
    read up to M records into buffer
    sort buffer in memory
    write buffer as a run to disk

After this phase, you have approximately:

R = ceil(N / M)

sorted runs on disk.

Phase 2. Multiway Merge

Merge multiple runs at once using a priority queue.

merge_runs(runs):
  open each run with a buffer
  heap = empty min-heap

  for each run i:
    read first block
    push (value, run_id) into heap

  while heap not empty:
    (value, i) = pop heap
    write value to output

    if run i has more data:
      read next value from run i
      push (value, i) into heap

Each run contributes its smallest remaining element to the heap. The heap ensures the next output element is globally smallest.

Merge Degree

The number of runs merged at once depends on memory.

If each run needs one input buffer and one output buffer, the number of runs merged in one pass is approximately:

k ≈ M / B - 1

A higher merge degree reduces the number of merge passes.

Number of Passes

The total number of merge passes is:

ceil(log_k R)

where:

R = number of initial runs
k = merge degree

Each pass reads and writes the entire dataset once.

Total I O Cost

Each pass processes all N records:

O(N / B) block transfers per pass

Total I O cost:

O((N / B) * log_k (N / M))

The algorithm is efficient when k is large, which requires large memory relative to block size.

Replacement Selection

Run generation can be improved using replacement selection. Instead of producing runs of size at most M, it can produce runs of average size about 2M.

The idea is to maintain a heap of size M:

replacement_selection(input):
  heap = first M elements
  last_output = -infinity

  while heap not empty:
    x = extract_min(heap)
    output x
    last_output = x

    if input has next value y:
      if y >= last_output:
        insert y into heap
      else:
        mark y for next run

Elements that break the sorted order are deferred to the next run. This increases run length and reduces the number of merge passes.

Correctness

Each run is sorted by internal sorting or by replacement selection.

The merge phase maintains the invariant:

The heap contains the smallest unmerged element from each run.

At each step, the smallest element across all runs is extracted and written to output. Since each run is sorted, and the heap always exposes the smallest candidate, the output is globally sorted.

The permutation condition holds because every input element is written exactly once during merging.

Stability

External merge sort is stable if:

runs are generated stably
merge uses stable tie-breaking

When equal keys appear from different runs, the merge step should preserve the order in which runs were created or use explicit tie-breaking keys.

Implementation Notes

Sequential access is critical. Always read and write data in blocks of size B.

Use buffering to overlap I O and computation. Double buffering can allow one buffer to be filled while another is being processed.

Use binary files rather than text when performance matters. Parsing text can dominate runtime.

Choose k as large as memory allows. A higher merge degree reduces the number of passes and therefore reduces total I O.

Practical Variants

Two-phase multiway merge sort is common when the number of runs is small enough to merge in one pass.

Polyphase merge uses uneven run sizes to reduce the number of temporary files.

Parallel external sorting distributes runs and merges across multiple disks or machines.

Common Bugs

A common bug is using too small buffers, which increases I O overhead.

Another bug is performing random disk access. External sorting must use sequential access to be efficient.

A third bug is forgetting to flush output buffers, leading to incomplete output.

A fourth bug is not handling end-of-run correctly during merging, causing incorrect termination.

Takeaway

External sorting replaces CPU-bound work with I O-efficient structure. It relies on run generation and multiway merging, and its performance depends on minimizing passes over the data and maximizing sequential access.