# Database External Sort

# Database External Sort

Database external sort is the disk-based sorting procedure used inside relational database query engines. It appears in query plans for operations such as `ORDER BY`, `GROUP BY`, `DISTINCT`, window functions, and sort-merge joins.

The method is usually a variant of external merge sort with database-specific concerns: rows, tuple comparison, memory quotas, spill files, collation rules, NULL ordering, and stable execution under limited working memory.

## Problem

Given a stream of database rows and a sort key, produce rows ordered by that key.

The input may exceed the memory granted to the query operator, so the sort must spill intermediate data to disk.

## Core Idea

The database sort operator has two main modes:

1. **In-memory sort**
   If all rows fit in the memory grant, sort directly and emit results.

2. **External sort**
   If rows exceed memory, sort chunks into temporary runs, write them to disk, then merge the runs.

```text id="j0eyfz"
database_external_sort(rows, key, memory_limit):
    runs = []

    while more rows:
        chunk = read rows until memory_limit
        sort chunk by key
        write chunk as temp run
        runs.append(temp run)

    if runs is empty:
        return sorted in-memory chunk

    return merge_runs(runs, key)
```

## Run Generation

Rows are read from the child operator until the memory limit is reached. The engine sorts those rows in memory and writes a sorted run to temporary storage.

```text id="t9k1ub"
generate_runs(input, key, memory_limit):
    runs = []

    while input has rows:
        chunk = read rows until memory_limit
        sort chunk using database comparator
        runs.append(write_temp_run(chunk))

    return runs
```

The comparator must respect database semantics:

* ascending or descending order
* NULLS FIRST or NULLS LAST
* collation for strings
* multiple key columns
* tie-breaking if stable order is required

## Merge Phase

The merge phase performs a k-way merge over sorted runs.

```text id="ssbl0k"
merge_runs(runs, key):
    heap = empty priority queue

    for each run:
        read first row
        push row into heap

    while heap not empty:
        row = extract smallest row
        emit row

        read next row from same run
        if row exists:
            push row into heap
```

## Example

Input rows:

| id | name | score |
| -: | ---- | ----: |
|  1 | Ada  |    91 |
|  2 | Ken  |    75 |
|  3 | Lin  |    88 |
|  4 | Bea  |    75 |
|  5 | Sol  |    99 |

Sort key:

```sql id="cozlt8"
ORDER BY score ASC, name ASC
```

If memory holds only three rows, run generation may produce:

| run | rows                            |
| --- | ------------------------------- |
| R1  | (75, Ken), (88, Lin), (91, Ada) |
| R2  | (75, Bea), (99, Sol)            |

The merge compares run heads:

| step | candidates     | emitted |
| ---: | -------------- | ------- |
|    1 | Ken 75, Bea 75 | Bea 75  |
|    2 | Ken 75, Sol 99 | Ken 75  |
|    3 | Lin 88, Sol 99 | Lin 88  |
|    4 | Ada 91, Sol 99 | Ada 91  |
|    5 | Sol 99         | Sol 99  |

Final order:

| id | name | score |
| -: | ---- | ----: |
|  4 | Bea  |    75 |
|  2 | Ken  |    75 |
|  3 | Lin  |    88 |
|  1 | Ada  |    91 |
|  5 | Sol  |    99 |

## Correctness

Each generated run is sorted according to the database comparator. The merge phase repeatedly emits the smallest available row among all run heads.

Since each run is sorted, no later row in a run can compare smaller than that run head. Therefore each emitted row is the smallest remaining row globally. After all runs are exhausted, every input row has been emitted once in sorted order.

## Complexity

Let:

* $N$ be the number of rows
* $M$ be the memory grant in rows
* $R = \lceil N / M \rceil$ be the number of runs
* $B$ be the block size
* $k$ be merge fan-in

Run generation costs:

$$
O(N \log M)
$$

CPU time for in-memory sorting.

Merge CPU cost:

$$
O(N \log k)
$$

per merge level.

I/O cost:

$$
O\left(\frac{N}{B} \log_k R\right)
$$

block transfers, plus the initial read and final write or output stream.

## Database-Specific Details

| concern       | effect                                     |
| ------------- | ------------------------------------------ |
| memory grant  | determines whether sort spills             |
| temp storage  | holds sorted runs                          |
| row width     | affects number of rows per run             |
| collation     | affects string comparison cost             |
| NULL ordering | affects comparator logic                   |
| LIMIT         | may allow top-k sort instead of full sort  |
| DISTINCT      | may remove duplicates during merge         |
| GROUP BY      | may aggregate while scanning sorted groups |

## When to Use

Database external sort is used when a query requires ordered rows and no suitable index already provides the needed order.

Common cases:

* `ORDER BY`
* sort-based `GROUP BY`
* `DISTINCT`
* window functions with ordering
* sort-merge join
* index creation

## Implementation Sketch

```python id="mfy0dx"
import heapq

def database_external_sort(rows, key, memory_limit):
    runs = []

    for i in range(0, len(rows), memory_limit):
        chunk = rows[i:i + memory_limit]
        chunk.sort(key=key)
        runs.append(chunk)

    return merge_sorted_runs(runs, key)
```

```python id="64nz98"
def merge_sorted_runs(runs, key):
    heap = []
    iters = [iter(run) for run in runs]

    for run_id, it in enumerate(iters):
        try:
            row = next(it)
            heapq.heappush(heap, (key(row), run_id, row))
        except StopIteration:
            pass

    out = []

    while heap:
        _, run_id, row = heapq.heappop(heap)
        out.append(row)

        try:
            nxt = next(iters[run_id])
            heapq.heappush(heap, (key(nxt), run_id, nxt))
        except StopIteration:
            pass

    return out
```

## Notes

Database external sort is an execution operator, not only a textbook algorithm. Its performance depends heavily on memory grants, row representation, temporary storage, and whether the optimizer can avoid sorting by using an existing index order.

