Skip to content

Spill Sort

Sorting method that keeps data in memory until a memory limit is reached, then spills sorted runs to external storage and merges them.

Spill sort is a practical external sorting pattern used when an operator begins as an in-memory sort but exceeds its memory budget. When memory fills, the current data is sorted and spilled to temporary storage as a run. After all input has been read, the spilled runs are merged.

This is common in databases, query engines, log processors, and stream processors.

Problem

Given an input stream and a memory limit, sort all records even when the full input does not fit in memory.

The algorithm must use memory when possible and external storage only when necessary.

Core Idea

Accumulate records in memory. If the memory budget is reached, sort the in-memory buffer and write it as a sorted run. Continue until the input ends. Then merge all sorted runs, including any remaining in-memory run.

spill_sort(input, memory_limit):
    buffer = []
    runs = []

    for record in input:
        buffer.append(record)

        if memory_used(buffer) >= memory_limit:
            sort(buffer)
            runs.append(spill_to_disk(buffer))
            buffer = []

    if buffer not empty:
        sort(buffer)
        runs.append(buffer)

    return merge_runs(runs)

Example

Input:

[8,3,5,1,9,2,7,4] [8, 3, 5, 1, 9, 2, 7, 4]

Memory limit:

3 3

Spilled runs:

buffersorted run
[8, 3, 5][3, 5, 8]
[1, 9, 2][1, 2, 9]
[7, 4][4, 7]

Merge:

[3,5,8],[1,2,9],[4,7] [3,5,8], [1,2,9], [4,7]

Final result:

[1,2,3,4,5,7,8,9] [1,2,3,4,5,7,8,9]

Correctness

Each spill run is sorted before it is written. The final merge repeatedly emits the smallest current element among all runs.

Since each run is sorted, no later element in a run can be smaller than that run’s current head. Therefore each emitted element is the smallest remaining element overall. Every input record is placed into exactly one run and emitted once, so the output is a sorted permutation of the input.

Complexity

Let:

  • NN be the number of records
  • MM be the memory limit in records
  • R=N/MR = \lceil N/M \rceil be the number of runs
  • kk be the merge fan-in
  • BB be the block size

Run sorting cost:

O(NlogM) O(N \log M)

Merge cost per merge level:

O(Nlogk) O(N \log k)

I/O cost:

O(NBlogkR) O\left(\frac{N}{B} \log_k R\right)

In the best case, when all records fit in memory, no spill occurs and the algorithm is just an in-memory sort.

Spill Policy

policybehavior
hard memory limitspill as soon as budget is reached
soft memory limitspill when pressure becomes high
adaptive memory grantgrow or shrink available memory
early spillavoids memory exhaustion
late spillmay reduce number of runs

When to Use

Use spill sort when:

  • input size is unknown in advance
  • memory use must be bounded
  • in-memory sorting is preferred when possible
  • external storage is available for temporary runs
  • the system needs graceful behavior under memory pressure

It is a common implementation strategy for database ORDER BY, GROUP BY, DISTINCT, and window operators.

Implementation Sketch

import heapq

def merge_runs(runs):
    heap = []
    iters = [iter(run) for run in runs]

    for run_id, it in enumerate(iters):
        try:
            value = next(it)
            heapq.heappush(heap, (value, run_id))
        except StopIteration:
            pass

    out = []

    while heap:
        value, run_id = heapq.heappop(heap)
        out.append(value)

        try:
            nxt = next(iters[run_id])
            heapq.heappush(heap, (nxt, run_id))
        except StopIteration:
            pass

    return out
def spill_sort(data, memory_limit):
    runs = []
    buffer = []

    for x in data:
        buffer.append(x)

        if len(buffer) >= memory_limit:
            runs.append(sorted(buffer))
            buffer = []

    if buffer:
        runs.append(sorted(buffer))

    return merge_runs(runs)

Notes

Spill sort is less a separate mathematical sorting algorithm and more an execution strategy. It starts with the optimistic case of in-memory sorting and falls back to external merge sorting only when memory pressure requires it.