Skip to content

Replacement Selection Run Generation

External sorting technique that uses a heap to produce initial sorted runs longer than the available memory.

Replacement selection run generation is the first phase of an external sort. It produces sorted runs from an input stream using a heap. Unlike simple chunk sorting, which creates runs of about memory size, replacement selection often creates runs about twice as large when input order is random.

Longer runs reduce the number of later merge passes.

Problem

Given an input stream too large for memory, produce sorted runs using memory for only MM records.

The output is not one final sorted file yet. It is a sequence of sorted runs that will later be merged.

Core Idea

Keep a min-heap of active records. Repeatedly output the smallest active record. When a new input record arrives:

  • if it is at least as large as the last output record, insert it into the current run
  • otherwise freeze it for the next run

When all active records are exhausted, start a new run using the frozen records.

Algorithm

replacement_selection(input, M):
    heap = first M records from input
    frozen = empty heap
    runs = []
    current_run = []
    last_output = -infinity

    while heap is not empty:
        x = extract_min(heap)
        append x to current_run
        last_output = x

        if input has next record:
            y = read next record

            if y >= last_output:
                insert y into heap
            else:
                insert y into frozen

        if heap is empty:
            append current_run to runs
            current_run = []
            heap = frozen
            frozen = empty heap
            last_output = -infinity

    return runs

Example

Let memory hold 3 records.

Input:

[4,8,1,7,2,9,3,6] [4, 8, 1, 7, 2, 9, 3, 6]

Initial heap:

[1,4,8] [1, 4, 8]

Process:

outputnew inputaction
17active, since 7 >= 1
42frozen, since 2 < 4
79active, since 9 >= 7
83frozen, since 3 < 8
96frozen, since 6 < 9

Current run:

[1,4,7,8,9] [1, 4, 7, 8, 9]

Frozen records become the next heap:

[2,3,6] [2, 3, 6]

Next run:

[2,3,6] [2, 3, 6]

Correctness

The active heap always contains records that can still extend the current run. Since the algorithm extracts the minimum active record, every emitted record is at least as large as the previous emitted record.

Records smaller than the last output cannot appear in the current run without breaking sorted order, so they are frozen for the next run. Every input record is inserted into exactly one heap and eventually emitted. Therefore each produced run is sorted and all records are preserved.

Complexity

Let:

  • NN be the number of records
  • MM be the heap capacity

Each record is inserted into and removed from a heap once.

CPU time:

O(NlogM) O(N \log M)

Memory usage:

O(M) O(M)

I/O is sequential:

O(N) O(N)

records read and written, excluding later merge phases.

Expected Run Length

For random input, replacement selection often produces runs with expected length about:

2M 2M

This is an average-case heuristic under common assumptions. Actual run length depends on input order.

input patternrun behavior
already sortedone very long run
randomabout 2M2M per run
reverse sortedabout MM per run

When to Use

Use replacement selection when:

  • building initial runs for external merge sort
  • input is streamed
  • memory is limited
  • reducing merge passes matters
  • input is not strongly reverse sorted

For modern systems with fast SSDs and large memory, simple chunk sorting may be easier. Replacement selection remains important in database and external sorting theory.

Implementation Sketch

import heapq
import math

def replacement_selection(data, memory_size):
    it = iter(data)

    heap = []
    frozen = []
    runs = []
    current = []
    last_output = -math.inf

    for _ in range(memory_size):
        try:
            heapq.heappush(heap, next(it))
        except StopIteration:
            break

    while heap:
        x = heapq.heappop(heap)
        current.append(x)
        last_output = x

        try:
            y = next(it)
            if y >= last_output:
                heapq.heappush(heap, y)
            else:
                heapq.heappush(frozen, y)
        except StopIteration:
            pass

        if not heap:
            runs.append(current)
            current = []
            heap, frozen = frozen, []
            last_output = -math.inf

    return runs

Notes

Replacement selection improves external sort by making the first runs longer. Longer runs mean fewer runs to merge, which can reduce the number of expensive external merge passes.