# B Tree Bulk Load Sort

# B Tree Bulk Load Sort

B tree bulk load sort constructs a B-tree from a dataset by first sorting the data and then building the tree in a single pass. The sorting phase produces ordered records, and the bulk loading phase organizes them into tree nodes efficiently.

This method avoids inserting records one by one into a B-tree, which would cause many random I/O operations.

## Problem

Given a large dataset, build a B-tree index in sorted order with minimal external-memory cost.

The input may not fit in memory, so both sorting and tree construction must use sequential I/O.

## Core Idea

Sort the data first, then construct the B-tree level by level:

* fill leaf nodes sequentially with sorted records
* build upper levels by grouping child nodes
* continue until the root is formed

Because the input is already sorted, nodes can be filled without backtracking or rebalancing.

## Algorithm

```text id="z8v6lm"
b_tree_bulk_load(records, node_capacity):
    sorted_records = external_sort(records)

    leaves = []
    current_leaf = empty node

    for record in sorted_records:
        append record to current_leaf

        if current_leaf is full:
            write current_leaf to disk
            leaves.append(current_leaf)
            current_leaf = empty node

    if current_leaf not empty:
        write it and append to leaves

    return build_upper_levels(leaves, node_capacity)
```

Building upper levels:

```text id="u7p4tk"
build_upper_levels(nodes, node_capacity):
    while length(nodes) > 1:
        parents = []
        current = empty node

        for node in nodes:
            append pointer to node into current

            if current is full:
                write current
                parents.append(current)
                current = empty node

        if current not empty:
            write current
            parents.append(current)

        nodes = parents

    return nodes[0]   # root
```

## Example

Suppose:

* leaf capacity = 3
* input (after sorting):

$$
[7, 12, 18, 30, 42, 55, 71, 90]
$$

### Step 1: Build Leaves

| leaf | records    |
| ---- | ---------- |
| L0   | 7, 12, 18  |
| L1   | 30, 42, 55 |
| L2   | 71, 90     |

### Step 2: Build Parent

| node | children   |
| ---- | ---------- |
| P0   | L0, L1, L2 |

This becomes the root.

## Correctness

The sorted input ensures that keys in each leaf node are ordered. Leaf nodes are written in sequence, so all keys in a leaf are less than or equal to all keys in the next leaf.

Each internal node stores pointers to child nodes in sorted order. The tree structure maintains the B-tree property that all keys in a subtree lie within a specific range.

Since all elements are placed in leaves and internal nodes only organize pointers, the resulting structure is a valid B-tree representing the sorted order of the input.

## Complexity

Let:

* $N$ be the number of records
* $B$ be the block size

Sorting cost:

$$
O\left(\frac{N}{B} \log_{\frac{M}{B}} \frac{N}{B}\right)
$$

Bulk loading cost:

$$
O\left(\frac{N}{B}\right)
$$

since nodes are written sequentially.

Total cost is dominated by the sorting phase.

## Comparison with Incremental Insertion

| method                       | behavior                        |
| ---------------------------- | ------------------------------- |
| incremental B-tree insertion | many random writes, node splits |
| bulk loading                 | sequential writes, no splits    |

Bulk loading is significantly faster for large datasets.

## When to Use

Use B-tree bulk load sort when:

* building an index from a large dataset
* input data can be sorted first
* minimizing random I/O is critical
* constructing database indexes or search trees

It is standard in database systems when creating indexes on existing tables.

## Implementation Sketch

```python id="z3q9pf"
def bulk_load(sorted_data, leaf_size):
    leaves = []
    for i in range(0, len(sorted_data), leaf_size):
        leaves.append(sorted_data[i:i + leaf_size])
    return leaves
```

```python id="g7t2qa"
def build_tree(nodes, fanout):
    while len(nodes) > 1:
        parents = []
        for i in range(0, len(nodes), fanout):
            parents.append(nodes[i:i + fanout])
        nodes = parents
    return nodes[0]
```

## Notes

B-tree bulk loading separates sorting from indexing. The sort establishes global order, and the bulk load constructs the tree efficiently from that order. This approach is widely used in storage engines and database systems for initial index creation.

