# Buffer Tree Sort

# Buffer Tree Sort

Buffer tree sort uses a buffer tree to sort data in the external-memory model. Instead of sorting all records directly, it inserts records into a high fanout search tree whose internal nodes hold buffers. When a buffer fills, records are pushed downward in batches.

The purpose is to reduce random I/O. Records move through the tree in large blocks, which makes the algorithm efficient when data is stored on disk or another block device.

## Model

Assume:

* $N$ records
* memory capacity $M$
* block size $B$
* external storage where block transfers dominate cost

A buffer tree usually has fanout about $M / B$, so each internal node has many children and a buffer large enough to flush many records at once.

## Core Idea

Insert all records into the root buffer. When a buffer becomes full, sort or partition its records by child range and flush them downward in large batches. After all insertions finish, flush all remaining buffers. The leaves then contain records in key order.

Finally, scan the leaves from left to right to produce the sorted output.

## Algorithm

```text id="6szflt"
buffer_tree_sort(records):
    T = empty buffer tree

    for record in records:
        insert record into root buffer

        if root buffer is full:
            flush(root)

    flush all remaining buffers

    return scan leaves from left to right
```

The flush operation moves a node buffer downward in batches.

```text id="60k89n"
flush(node):
    group buffered records by child key range

    for each child group:
        append group to child buffer

        if child buffer is full:
            flush(child)

    clear node buffer
```

## Example

Suppose a buffer tree has three leaf ranges:

| leaf | key range       |
| ---- | --------------- |
| L0   | keys < 30       |
| L1   | 30 <= keys < 70 |
| L2   | keys >= 70      |

Input:

$$
[42, 7, 90, 18, 55, 71]
$$

Records first enter the root buffer. When flushed, they are distributed:

| leaf | records |
| ---- | ------- |
| L0   | 7, 18   |
| L1   | 42, 55  |
| L2   | 90, 71  |

Each leaf is sorted locally:

| leaf | sorted records |
| ---- | -------------- |
| L0   | 7, 18          |
| L1   | 42, 55         |
| L2   | 71, 90         |

Scanning leaves gives:

$$
[7, 18, 42, 55, 71, 90]
$$

## Correctness

Each internal node partitions records by ordered key ranges. When a record is flushed from a node to a child, it is sent to the unique child whose range contains the record key.

After all buffers are flushed, every record reaches a leaf whose key range contains it. Leaves appear in increasing key range order, and records within each leaf are sorted. Therefore scanning the leaves from left to right returns all records in sorted order.

## Complexity

Buffer tree sort achieves the external sorting bound:

$$
O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}\right)
$$

I/O operations under the standard external-memory assumptions.

The CPU cost depends on the cost of partitioning and sorting buffered batches. In practice, batching improves locality and reduces per-record overhead.

## Why Buffering Helps

A plain search tree may move one record at a time through many nodes. That causes many small and random I/O operations.

A buffer tree accumulates many records at a node and moves them together. This changes the access pattern from fine-grained updates to block-sized transfers.

## Comparison

| method                | main operation                  | I/O pattern      |
| --------------------- | ------------------------------- | ---------------- |
| External merge sort   | generate runs, then merge       | sequential       |
| Buffer tree sort      | batch insert into buffered tree | batched tree I/O |
| B-tree bulk load sort | sort first, then build tree     | sequential build |

## When to Use

Use buffer tree sort when:

* data is too large for memory
* many records are inserted before sorted output is needed
* batched external-memory updates are natural
* you want asymptotically optimal external sorting I/O
* the data may later remain in a searchable tree structure

It is less common as a simple standalone file sort, where external merge sort is usually easier to implement.

## Implementation Sketch

```python id="e1d4kl"
class BufferNode:
    def __init__(self, splitters=None, children=None):
        self.splitters = splitters or []
        self.children = children or []
        self.buffer = []
        self.records = []

    def is_leaf(self):
        return len(self.children) == 0
```

```python id="dg70pn"
from bisect import bisect_left

def insert(node, record, capacity):
    node.buffer.append(record)

    if len(node.buffer) >= capacity:
        flush(node, capacity)

def flush(node, capacity):
    if node.is_leaf():
        node.records.extend(node.buffer)
        node.buffer.clear()
        return

    groups = [[] for _ in node.children]

    for record in node.buffer:
        i = bisect_left(node.splitters, record)
        groups[i].append(record)

    node.buffer.clear()

    for child, group in zip(node.children, groups):
        child.buffer.extend(group)

        if len(child.buffer) >= capacity:
            flush(child, capacity)
```

```python id="pr8j0u"
def scan(node):
    if node.is_leaf():
        return sorted(node.records + node.buffer)

    out = []
    for child in node.children:
        out.extend(scan(child))
    return out
```

## Notes

Buffer tree sort is important because it shows how sorting can be expressed as batched ordered insertion. The tree structure preserves key order, while buffers make the movement of records efficient in external memory.

