Skip to content

Replacement Selection

Generate long sorted runs from a stream using a heap, commonly used in external sorting.

Replacement selection is a run generation technique used in external sorting. It produces sorted runs that are often longer than the available memory size by using a priority queue.

You use it when sorting data that does not fit in memory.

Problem

Given a stream of elements and limited memory of size ( M ), produce sorted runs on disk such that:

  • each run is sorted
  • runs are as long as possible

These runs are later merged.

Idea

Maintain a min heap of size ( M ). At each step:

  • output the smallest element
  • replace it with a new element from the input
  • if the new element is smaller than the last output, defer it to the next run

This allows runs to grow beyond ( M ).

Algorithm

replacement_selection(stream, M):
    heap = read_first_M_elements(stream)
    build_min_heap(heap)

    last_output = -infinity
    next_run = []

    while heap not empty:
        x = heap.pop()
        output(x)
        last_output = x

        if stream not empty:
            y = next(stream)

            if y >= last_output:
                heap.push(y)
            else:
                next_run.append(y)

        if heap empty:
            write_run_boundary()
            heap = next_run
            build_min_heap(heap)
            next_run = []
            last_output = -infinity

Example

Memory size ( M = 3 )

Input:

( [5, 2, 8, 1, 7, 3, 6, 4] )

Process:

  • initial heap: [2, 5, 8]
  • output 2 → insert 1 (deferred)
  • output 5 → insert 7
  • output 7 → insert 3 (deferred)
  • output 8 → insert 6 (deferred)

First run:

( [2, 5, 7, 8] )

Deferred elements form next run.

Correctness

The heap always contains candidates that can extend the current run. Any element smaller than the last output cannot belong to the current sorted sequence, so it is deferred. This guarantees that each run remains sorted.

Run Length

Expected run length:

[ \approx 2M ]

under random input assumptions. This is significantly larger than memory size.

Complexity

operationcost
heap operations( O(\log M) )
total( O(n \log M) )

Space:

[ O(M) ]

When to Use

Replacement selection is useful when:

  • sorting very large datasets
  • building initial runs for external merge sort
  • disk I/O dominates cost

It reduces the number of runs, which lowers the number of merge passes required later.