# Spill Sort

# Spill Sort

Spill sort is a practical external sorting pattern used when an operator begins as an in-memory sort but exceeds its memory budget. When memory fills, the current data is sorted and spilled to temporary storage as a run. After all input has been read, the spilled runs are merged.

This is common in databases, query engines, log processors, and stream processors.

## Problem

Given an input stream and a memory limit, sort all records even when the full input does not fit in memory.

The algorithm must use memory when possible and external storage only when necessary.

## Core Idea

Accumulate records in memory. If the memory budget is reached, sort the in-memory buffer and write it as a sorted run. Continue until the input ends. Then merge all sorted runs, including any remaining in-memory run.

```text id="e20kpg"
spill_sort(input, memory_limit):
    buffer = []
    runs = []

    for record in input:
        buffer.append(record)

        if memory_used(buffer) >= memory_limit:
            sort(buffer)
            runs.append(spill_to_disk(buffer))
            buffer = []

    if buffer not empty:
        sort(buffer)
        runs.append(buffer)

    return merge_runs(runs)
```

## Example

Input:

$$
[8, 3, 5, 1, 9, 2, 7, 4]
$$

Memory limit:

$$
3
$$

Spilled runs:

| buffer    | sorted run |
| --------- | ---------- |
| [8, 3, 5] | [3, 5, 8]  |
| [1, 9, 2] | [1, 2, 9]  |
| [7, 4]    | [4, 7]     |

Merge:

$$
[3,5,8], [1,2,9], [4,7]
$$

Final result:

$$
[1,2,3,4,5,7,8,9]
$$

## Correctness

Each spill run is sorted before it is written. The final merge repeatedly emits the smallest current element among all runs.

Since each run is sorted, no later element in a run can be smaller than that run's current head. Therefore each emitted element is the smallest remaining element overall. Every input record is placed into exactly one run and emitted once, so the output is a sorted permutation of the input.

## Complexity

Let:

* $N$ be the number of records
* $M$ be the memory limit in records
* $R = \lceil N/M \rceil$ be the number of runs
* $k$ be the merge fan-in
* $B$ be the block size

Run sorting cost:

$$
O(N \log M)
$$

Merge cost per merge level:

$$
O(N \log k)
$$

I/O cost:

$$
O\left(\frac{N}{B} \log_k R\right)
$$

In the best case, when all records fit in memory, no spill occurs and the algorithm is just an in-memory sort.

## Spill Policy

| policy                | behavior                           |
| --------------------- | ---------------------------------- |
| hard memory limit     | spill as soon as budget is reached |
| soft memory limit     | spill when pressure becomes high   |
| adaptive memory grant | grow or shrink available memory    |
| early spill           | avoids memory exhaustion           |
| late spill            | may reduce number of runs          |

## When to Use

Use spill sort when:

* input size is unknown in advance
* memory use must be bounded
* in-memory sorting is preferred when possible
* external storage is available for temporary runs
* the system needs graceful behavior under memory pressure

It is a common implementation strategy for database `ORDER BY`, `GROUP BY`, `DISTINCT`, and window operators.

## Implementation Sketch

```python id="o6bmj4"
import heapq

def merge_runs(runs):
    heap = []
    iters = [iter(run) for run in runs]

    for run_id, it in enumerate(iters):
        try:
            value = next(it)
            heapq.heappush(heap, (value, run_id))
        except StopIteration:
            pass

    out = []

    while heap:
        value, run_id = heapq.heappop(heap)
        out.append(value)

        try:
            nxt = next(iters[run_id])
            heapq.heappush(heap, (nxt, run_id))
        except StopIteration:
            pass

    return out
```

```python id="qa3b0p"
def spill_sort(data, memory_limit):
    runs = []
    buffer = []

    for x in data:
        buffer.append(x)

        if len(buffer) >= memory_limit:
            runs.append(sorted(buffer))
            buffer = []

    if buffer:
        runs.append(sorted(buffer))

    return merge_runs(runs)
```

## Notes

Spill sort is less a separate mathematical sorting algorithm and more an execution strategy. It starts with the optimistic case of in-memory sorting and falls back to external merge sorting only when memory pressure requires it.

