External Bucket Sort

External bucket sort extends bucket sort to data that does not fit in memory. It partitions records into ordered buckets stored on external storage, sorts each bucket independently, and concatenates the buckets in key order.

The method works well when the key distribution is known well enough to create buckets of manageable size.

Problem

Given a large dataset and a key function, sort all records using bounded memory.

The algorithm assumes that records can be assigned to ordered buckets:

B_0, B_1, \ldots, B_{k-1}

where every key in $B_i$ is less than or equal to every key in $B_j$ whenever $i < j$ .

Core Idea

Create external buckets, scan the input once, and write each record to its bucket. Then sort each bucket separately. Since bucket ranges are ordered, the final output is the concatenation of sorted buckets.

external_bucket_sort(input, bucket_ranges):
    buckets = create_external_buckets(bucket_ranges)

    for record in input:
        i = bucket_index(record.key, bucket_ranges)
        append record to buckets[i]

    for bucket in buckets:
        sort_bucket(bucket)

    return concatenate buckets in order

Bucket Assignment

For fixed numeric ranges, bucket assignment can be computed directly.

bucket_index(x, min_key, max_key, k):
    width = (max_key - min_key + 1) / k
    return floor((x - min_key) / width)

For arbitrary splitters, use binary search.

bucket_index_by_splitters(key, splitters):
    return first i such that key <= splitters[i]

Example

Input:

[42, 7, 18, 93, 55, 12, 71, 30]

Use ranges:

bucket	range
$B_0$	0 to 24
$B_1$	25 to 49
$B_2$	50 to 74
$B_3$	75 to 99

Distribution:

bucket	values
$B_0$	7, 18, 12
$B_1$	42, 30
$B_2$	55, 71
$B_3$	93

Sort buckets:

bucket	sorted
$B_0$	7, 12, 18
$B_1$	30, 42
$B_2$	55, 71
$B_3$	93

Concatenate:

[7, 12, 18, 30, 42, 55, 71, 93]

Correctness

The bucket ranges are ordered and non-overlapping. Each record is assigned to exactly one bucket whose range contains its key.

Each bucket is sorted internally. Since every key in an earlier bucket is less than or equal to every key in a later bucket, concatenating buckets in bucket order produces a globally sorted sequence.

Every record is written to one bucket and emitted once, so the result is a sorted permutation of the input.

Complexity

Let:

$N$ be the number of records
$k$ be the number of buckets
$B$ be the block size
$N_i$ be the size of bucket $i$

Distribution I/O:

O\left(\frac{N}{B}\right)

Sorting cost:

\sum_{i=0}^{k-1} T(N_i)

With balanced buckets and comparison sorting inside buckets:

O(N \log (N/k))

CPU time.

If buckets fit in memory, each bucket can be sorted internally and written once.

Design Choices

choice	effect
number of buckets	controls bucket size
bucket ranges	determines balance
buffering	reduces small writes
recursive bucketing	handles oversized buckets
stable bucket sort	preserves equal-key order

When to Use

Use external bucket sort when:

keys have a known or estimable range
bucket boundaries can be chosen well
buckets can be sorted independently
sequential writes to bucket files are efficient
parallel bucket processing is useful

Avoid it when the key distribution is highly skewed and bucket ranges cannot be balanced.

Implementation Sketch

def bucket_index(x, min_key, max_key, bucket_count):
    if x == max_key:
        return bucket_count - 1

    width = (max_key - min_key + 1) / bucket_count
    return int((x - min_key) // width)

def external_bucket_sort_model(data, bucket_count):
    if not data:
        return []

    min_key = min(data)
    max_key = max(data)

    buckets = [[] for _ in range(bucket_count)]

    for x in data:
        i = bucket_index(x, min_key, max_key, bucket_count)
        buckets[i].append(x)

    out = []
    for bucket in buckets:
        out.extend(sorted(bucket))

    return out

Notes

External bucket sort replaces repeated merging with ordered partitioning. Its performance depends on bucket balance. With good buckets, the algorithm can sort large data using mostly sequential scans and independent local sorts.