Buffer Tree Sort

Buffer tree sort uses a buffer tree to sort data in the external-memory model. Instead of sorting all records directly, it inserts records into a high fanout search tree whose internal nodes hold buffers. When a buffer fills, records are pushed downward in batches.

The purpose is to reduce random I/O. Records move through the tree in large blocks, which makes the algorithm efficient when data is stored on disk or another block device.

Model

Assume:

$N$ records
memory capacity $M$
block size $B$
external storage where block transfers dominate cost

A buffer tree usually has fanout about $M / B$, so each internal node has many children and a buffer large enough to flush many records at once.

Core Idea

Insert all records into the root buffer. When a buffer becomes full, sort or partition its records by child range and flush them downward in large batches. After all insertions finish, flush all remaining buffers. The leaves then contain records in key order.

Finally, scan the leaves from left to right to produce the sorted output.

Algorithm

buffer_tree_sort(records):
    T = empty buffer tree

    for record in records:
        insert record into root buffer

        if root buffer is full:
            flush(root)

    flush all remaining buffers

    return scan leaves from left to right

The flush operation moves a node buffer downward in batches.

flush(node):
    group buffered records by child key range

    for each child group:
        append group to child buffer

        if child buffer is full:
            flush(child)

    clear node buffer

Example

Suppose a buffer tree has three leaf ranges:

leaf	key range
L0	keys < 30
L1	30 <= keys < 70
L2	keys >= 70

Input:

$$ [42, 7, 90, 18, 55, 71] $$

Records first enter the root buffer. When flushed, they are distributed:

leaf	records
L0	7, 18
L1	42, 55
L2	90, 71

Each leaf is sorted locally:

leaf	sorted records
L0	7, 18
L1	42, 55
L2	71, 90

Scanning leaves gives:

$$ [7, 18, 42, 55, 71, 90] $$

Correctness

Each internal node partitions records by ordered key ranges. When a record is flushed from a node to a child, it is sent to the unique child whose range contains the record key.

After all buffers are flushed, every record reaches a leaf whose key range contains it. Leaves appear in increasing key range order, and records within each leaf are sorted. Therefore scanning the leaves from left to right returns all records in sorted order.

Complexity

Buffer tree sort achieves the external sorting bound:

$$ O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}\right) $$

I/O operations under the standard external-memory assumptions.

The CPU cost depends on the cost of partitioning and sorting buffered batches. In practice, batching improves locality and reduces per-record overhead.

Why Buffering Helps

A plain search tree may move one record at a time through many nodes. That causes many small and random I/O operations.

A buffer tree accumulates many records at a node and moves them together. This changes the access pattern from fine-grained updates to block-sized transfers.

Comparison

method	main operation	I/O pattern
External merge sort	generate runs, then merge	sequential
Buffer tree sort	batch insert into buffered tree	batched tree I/O
B-tree bulk load sort	sort first, then build tree	sequential build

When to Use

Use buffer tree sort when:

data is too large for memory
many records are inserted before sorted output is needed
batched external-memory updates are natural
you want asymptotically optimal external sorting I/O
the data may later remain in a searchable tree structure

It is less common as a simple standalone file sort, where external merge sort is usually easier to implement.

Implementation Sketch

class BufferNode:
    def __init__(self, splitters=None, children=None):
        self.splitters = splitters or []
        self.children = children or []
        self.buffer = []
        self.records = []

    def is_leaf(self):
        return len(self.children) == 0

from bisect import bisect_left

def insert(node, record, capacity):
    node.buffer.append(record)

    if len(node.buffer) >= capacity:
        flush(node, capacity)

def flush(node, capacity):
    if node.is_leaf():
        node.records.extend(node.buffer)
        node.buffer.clear()
        return

    groups = [[] for _ in node.children]

    for record in node.buffer:
        i = bisect_left(node.splitters, record)
        groups[i].append(record)

    node.buffer.clear()

    for child, group in zip(node.children, groups):
        child.buffer.extend(group)

        if len(child.buffer) >= capacity:
            flush(child, capacity)

def scan(node):
    if node.is_leaf():
        return sorted(node.records + node.buffer)

    out = []
    for child in node.children:
        out.extend(scan(child))
    return out

Notes

Buffer tree sort is important because it shows how sorting can be expressed as batched ordered insertion. The tree structure preserves key order, while buffers make the movement of records efficient in external memory.