# External Bucket Sort

# External Bucket Sort

External bucket sort extends bucket sort to data that does not fit in memory. It partitions records into ordered buckets stored on external storage, sorts each bucket independently, and concatenates the buckets in key order.

The method works well when the key distribution is known well enough to create buckets of manageable size.

## Problem

Given a large dataset and a key function, sort all records using bounded memory.

The algorithm assumes that records can be assigned to ordered buckets:

$$
B_0, B_1, \ldots, B_{k-1}
$$

where every key in $B_i$ is less than or equal to every key in $B_j$ whenever $i < j$.

## Core Idea

Create external buckets, scan the input once, and write each record to its bucket. Then sort each bucket separately. Since bucket ranges are ordered, the final output is the concatenation of sorted buckets.

```text id="b9r2kd"
external_bucket_sort(input, bucket_ranges):
    buckets = create_external_buckets(bucket_ranges)

    for record in input:
        i = bucket_index(record.key, bucket_ranges)
        append record to buckets[i]

    for bucket in buckets:
        sort_bucket(bucket)

    return concatenate buckets in order
```

## Bucket Assignment

For fixed numeric ranges, bucket assignment can be computed directly.

```text id="j62qfd"
bucket_index(x, min_key, max_key, k):
    width = (max_key - min_key + 1) / k
    return floor((x - min_key) / width)
```

For arbitrary splitters, use binary search.

```text id="q4am8s"
bucket_index_by_splitters(key, splitters):
    return first i such that key <= splitters[i]
```

## Example

Input:

$$
[42, 7, 18, 93, 55, 12, 71, 30]
$$

Use ranges:

| bucket | range    |
| ------ | -------- |
| $B_0$  | 0 to 24  |
| $B_1$  | 25 to 49 |
| $B_2$  | 50 to 74 |
| $B_3$  | 75 to 99 |

Distribution:

| bucket | values    |
| ------ | --------- |
| $B_0$  | 7, 18, 12 |
| $B_1$  | 42, 30    |
| $B_2$  | 55, 71    |
| $B_3$  | 93        |

Sort buckets:

| bucket | sorted    |
| ------ | --------- |
| $B_0$  | 7, 12, 18 |
| $B_1$  | 30, 42    |
| $B_2$  | 55, 71    |
| $B_3$  | 93        |

Concatenate:

$$
[7, 12, 18, 30, 42, 55, 71, 93]
$$

## Correctness

The bucket ranges are ordered and non-overlapping. Each record is assigned to exactly one bucket whose range contains its key.

Each bucket is sorted internally. Since every key in an earlier bucket is less than or equal to every key in a later bucket, concatenating buckets in bucket order produces a globally sorted sequence.

Every record is written to one bucket and emitted once, so the result is a sorted permutation of the input.

## Complexity

Let:

* $N$ be the number of records
* $k$ be the number of buckets
* $B$ be the block size
* $N_i$ be the size of bucket $i$

Distribution I/O:

$$
O\left(\frac{N}{B}\right)
$$

Sorting cost:

$$
\sum_{i=0}^{k-1} T(N_i)
$$

With balanced buckets and comparison sorting inside buckets:

$$
O(N \log (N/k))
$$

CPU time.

If buckets fit in memory, each bucket can be sorted internally and written once.

## Design Choices

| choice              | effect                    |
| ------------------- | ------------------------- |
| number of buckets   | controls bucket size      |
| bucket ranges       | determines balance        |
| buffering           | reduces small writes      |
| recursive bucketing | handles oversized buckets |
| stable bucket sort  | preserves equal-key order |

## When to Use

Use external bucket sort when:

* keys have a known or estimable range
* bucket boundaries can be chosen well
* buckets can be sorted independently
* sequential writes to bucket files are efficient
* parallel bucket processing is useful

Avoid it when the key distribution is highly skewed and bucket ranges cannot be balanced.

## Implementation Sketch

```python id="ts6f2n"
def bucket_index(x, min_key, max_key, bucket_count):
    if x == max_key:
        return bucket_count - 1

    width = (max_key - min_key + 1) / bucket_count
    return int((x - min_key) // width)
```

```python id="xv0u1e"
def external_bucket_sort_model(data, bucket_count):
    if not data:
        return []

    min_key = min(data)
    max_key = max(data)

    buckets = [[] for _ in range(bucket_count)]

    for x in data:
        i = bucket_index(x, min_key, max_key, bucket_count)
        buckets[i].append(x)

    out = []
    for bucket in buckets:
        out.extend(sorted(bucket))

    return out
```

## Notes

External bucket sort replaces repeated merging with ordered partitioning. Its performance depends on bucket balance. With good buckets, the algorithm can sort large data using mostly sequential scans and independent local sorts.

