Skip to content

B Tree Bulk Load Sort

Sorting method that builds a B-tree directly from sorted data using bulk loading to minimize I/O operations.

B tree bulk load sort constructs a B-tree from a dataset by first sorting the data and then building the tree in a single pass. The sorting phase produces ordered records, and the bulk loading phase organizes them into tree nodes efficiently.

This method avoids inserting records one by one into a B-tree, which would cause many random I/O operations.

Problem

Given a large dataset, build a B-tree index in sorted order with minimal external-memory cost.

The input may not fit in memory, so both sorting and tree construction must use sequential I/O.

Core Idea

Sort the data first, then construct the B-tree level by level:

  • fill leaf nodes sequentially with sorted records
  • build upper levels by grouping child nodes
  • continue until the root is formed

Because the input is already sorted, nodes can be filled without backtracking or rebalancing.

Algorithm

b_tree_bulk_load(records, node_capacity):
    sorted_records = external_sort(records)

    leaves = []
    current_leaf = empty node

    for record in sorted_records:
        append record to current_leaf

        if current_leaf is full:
            write current_leaf to disk
            leaves.append(current_leaf)
            current_leaf = empty node

    if current_leaf not empty:
        write it and append to leaves

    return build_upper_levels(leaves, node_capacity)

Building upper levels:

build_upper_levels(nodes, node_capacity):
    while length(nodes) > 1:
        parents = []
        current = empty node

        for node in nodes:
            append pointer to node into current

            if current is full:
                write current
                parents.append(current)
                current = empty node

        if current not empty:
            write current
            parents.append(current)

        nodes = parents

    return nodes[0]   # root

Example

Suppose:

  • leaf capacity = 3
  • input (after sorting):
[7,12,18,30,42,55,71,90] [7, 12, 18, 30, 42, 55, 71, 90]

Step 1: Build Leaves

leafrecords
L07, 12, 18
L130, 42, 55
L271, 90

Step 2: Build Parent

nodechildren
P0L0, L1, L2

This becomes the root.

Correctness

The sorted input ensures that keys in each leaf node are ordered. Leaf nodes are written in sequence, so all keys in a leaf are less than or equal to all keys in the next leaf.

Each internal node stores pointers to child nodes in sorted order. The tree structure maintains the B-tree property that all keys in a subtree lie within a specific range.

Since all elements are placed in leaves and internal nodes only organize pointers, the resulting structure is a valid B-tree representing the sorted order of the input.

Complexity

Let:

  • NN be the number of records
  • BB be the block size

Sorting cost:

O(NBlogMBNB) O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}\right)

Bulk loading cost:

O(NB) O\left(\frac{N}{B}\right)

since nodes are written sequentially.

Total cost is dominated by the sorting phase.

Comparison with Incremental Insertion

methodbehavior
incremental B-tree insertionmany random writes, node splits
bulk loadingsequential writes, no splits

Bulk loading is significantly faster for large datasets.

When to Use

Use B-tree bulk load sort when:

  • building an index from a large dataset
  • input data can be sorted first
  • minimizing random I/O is critical
  • constructing database indexes or search trees

It is standard in database systems when creating indexes on existing tables.

Implementation Sketch

def bulk_load(sorted_data, leaf_size):
    leaves = []
    for i in range(0, len(sorted_data), leaf_size):
        leaves.append(sorted_data[i:i + leaf_size])
    return leaves
def build_tree(nodes, fanout):
    while len(nodes) > 1:
        parents = []
        for i in range(0, len(nodes), fanout):
            parents.append(nodes[i:i + fanout])
        nodes = parents
    return nodes[0]

Notes

B-tree bulk loading separates sorting from indexing. The sort establishes global order, and the bulk load constructs the tree efficiently from that order. This approach is widely used in storage engines and database systems for initial index creation.