Skip to content

Buffer Tree Sort

External-memory sorting method that inserts records into a buffered tree and extracts them in sorted order.

Buffer tree sort uses a buffer tree to sort data in the external-memory model. Instead of sorting all records directly, it inserts records into a high fanout search tree whose internal nodes hold buffers. When a buffer fills, records are pushed downward in batches.

The purpose is to reduce random I/O. Records move through the tree in large blocks, which makes the algorithm efficient when data is stored on disk or another block device.

Model

Assume:

  • NN records
  • memory capacity MM
  • block size BB
  • external storage where block transfers dominate cost

A buffer tree usually has fanout about M/BM / B, so each internal node has many children and a buffer large enough to flush many records at once.

Core Idea

Insert all records into the root buffer. When a buffer becomes full, sort or partition its records by child range and flush them downward in large batches. After all insertions finish, flush all remaining buffers. The leaves then contain records in key order.

Finally, scan the leaves from left to right to produce the sorted output.

Algorithm

buffer_tree_sort(records):
    T = empty buffer tree

    for record in records:
        insert record into root buffer

        if root buffer is full:
            flush(root)

    flush all remaining buffers

    return scan leaves from left to right

The flush operation moves a node buffer downward in batches.

flush(node):
    group buffered records by child key range

    for each child group:
        append group to child buffer

        if child buffer is full:
            flush(child)

    clear node buffer

Example

Suppose a buffer tree has three leaf ranges:

leafkey range
L0keys < 30
L130 <= keys < 70
L2keys >= 70

Input:

[42,7,90,18,55,71] [42, 7, 90, 18, 55, 71]

Records first enter the root buffer. When flushed, they are distributed:

leafrecords
L07, 18
L142, 55
L290, 71

Each leaf is sorted locally:

leafsorted records
L07, 18
L142, 55
L271, 90

Scanning leaves gives:

[7,18,42,55,71,90] [7, 18, 42, 55, 71, 90]

Correctness

Each internal node partitions records by ordered key ranges. When a record is flushed from a node to a child, it is sent to the unique child whose range contains the record key.

After all buffers are flushed, every record reaches a leaf whose key range contains it. Leaves appear in increasing key range order, and records within each leaf are sorted. Therefore scanning the leaves from left to right returns all records in sorted order.

Complexity

Buffer tree sort achieves the external sorting bound:

O(NBlogMBNB) O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}\right)

I/O operations under the standard external-memory assumptions.

The CPU cost depends on the cost of partitioning and sorting buffered batches. In practice, batching improves locality and reduces per-record overhead.

Why Buffering Helps

A plain search tree may move one record at a time through many nodes. That causes many small and random I/O operations.

A buffer tree accumulates many records at a node and moves them together. This changes the access pattern from fine-grained updates to block-sized transfers.

Comparison

methodmain operationI/O pattern
External merge sortgenerate runs, then mergesequential
Buffer tree sortbatch insert into buffered treebatched tree I/O
B-tree bulk load sortsort first, then build treesequential build

When to Use

Use buffer tree sort when:

  • data is too large for memory
  • many records are inserted before sorted output is needed
  • batched external-memory updates are natural
  • you want asymptotically optimal external sorting I/O
  • the data may later remain in a searchable tree structure

It is less common as a simple standalone file sort, where external merge sort is usually easier to implement.

Implementation Sketch

class BufferNode:
    def __init__(self, splitters=None, children=None):
        self.splitters = splitters or []
        self.children = children or []
        self.buffer = []
        self.records = []

    def is_leaf(self):
        return len(self.children) == 0
from bisect import bisect_left

def insert(node, record, capacity):
    node.buffer.append(record)

    if len(node.buffer) >= capacity:
        flush(node, capacity)

def flush(node, capacity):
    if node.is_leaf():
        node.records.extend(node.buffer)
        node.buffer.clear()
        return

    groups = [[] for _ in node.children]

    for record in node.buffer:
        i = bisect_left(node.splitters, record)
        groups[i].append(record)

    node.buffer.clear()

    for child, group in zip(node.children, groups):
        child.buffer.extend(group)

        if len(child.buffer) >= capacity:
            flush(child, capacity)
def scan(node):
    if node.is_leaf():
        return sorted(node.records + node.buffer)

    out = []
    for child in node.children:
        out.extend(scan(child))
    return out

Notes

Buffer tree sort is important because it shows how sorting can be expressed as batched ordered insertion. The tree structure preserves key order, while buffers make the movement of records efficient in external memory.