Skip to content

External Replacement Selection

Generate long initial runs for external sorting using replacement selection with disk-based streams.

External replacement selection is the disk-oriented version of replacement selection. It generates long sorted runs from data that does not fit in memory, forming the first phase of external merge sort.

You use it when sorting large datasets stored on disk.

Problem

Given a dataset too large to fit into memory and a memory buffer of size ( M ), generate sorted runs on disk such that:

  • each run is sorted
  • runs are as long as possible
  • disk I/O is minimized

Idea

Maintain a min heap in memory. Repeatedly output the smallest element and replace it with the next input element. If the new element is smaller than the last output, defer it to the next run.

This produces runs that are typically longer than the memory size.

Algorithm

external_replacement_selection(stream, M):
    heap = load_first_M_elements(stream)
    build_min_heap(heap)

    last_output = -infinity
    next_run = []

    while heap not empty:
        x = heap.pop()
        write_to_disk(x)
        last_output = x

        if stream not empty:
            y = read_next(stream)

            if y >= last_output:
                heap.push(y)
            else:
                next_run.append(y)

        if heap empty:
            write_run_boundary()
            heap = next_run
            build_min_heap(heap)
            next_run = []
            last_output = -infinity

Example

Memory size:

[ M = 4 ]

Input stream:

[ [6, 2, 9, 3, 1, 8, 4, 7, 5] ]

First run might become:

[ [2, 3, 6, 8, 9] ]

Deferred elements:

[ [1, 4, 7, 5] ]

Next run processes deferred elements.

Correctness

The heap always contains candidates for the current run. Any element smaller than the last output would violate sorted order, so it is deferred.

Thus:

  • each run is sorted
  • all elements are eventually included in some run

Run Length

Expected run length under random input:

[ \approx 2M ]

This reduces the number of runs and improves merge efficiency.

Complexity

operationcost
heap operations( O(\log M) )
total time( O(n \log M) )
disk I/Osequential reads and writes

Space:

[ O(M) ]

Practical Considerations

  • disk access dominates cost, so sequential I/O is critical
  • buffering should be used for reads and writes
  • run boundaries must be recorded for merging

When to Use

Use external replacement selection when:

  • data does not fit in memory
  • building initial runs for external merge sort
  • minimizing number of merge passes matters

It is a core technique in database systems and large-scale data processing.