Skip to content

External Bucket Sort

External-memory bucket sort that partitions records into ordered buckets stored on disk and sorts each bucket separately.

External bucket sort extends bucket sort to data that does not fit in memory. It partitions records into ordered buckets stored on external storage, sorts each bucket independently, and concatenates the buckets in key order.

The method works well when the key distribution is known well enough to create buckets of manageable size.

Problem

Given a large dataset and a key function, sort all records using bounded memory.

The algorithm assumes that records can be assigned to ordered buckets:

B0,B1,,Bk1 B_0, B_1, \ldots, B_{k-1}

where every key in BiB_i is less than or equal to every key in BjB_j whenever i<ji < j.

Core Idea

Create external buckets, scan the input once, and write each record to its bucket. Then sort each bucket separately. Since bucket ranges are ordered, the final output is the concatenation of sorted buckets.

external_bucket_sort(input, bucket_ranges):
    buckets = create_external_buckets(bucket_ranges)

    for record in input:
        i = bucket_index(record.key, bucket_ranges)
        append record to buckets[i]

    for bucket in buckets:
        sort_bucket(bucket)

    return concatenate buckets in order

Bucket Assignment

For fixed numeric ranges, bucket assignment can be computed directly.

bucket_index(x, min_key, max_key, k):
    width = (max_key - min_key + 1) / k
    return floor((x - min_key) / width)

For arbitrary splitters, use binary search.

bucket_index_by_splitters(key, splitters):
    return first i such that key <= splitters[i]

Example

Input:

[42,7,18,93,55,12,71,30] [42, 7, 18, 93, 55, 12, 71, 30]

Use ranges:

bucketrange
B0B_00 to 24
B1B_125 to 49
B2B_250 to 74
B3B_375 to 99

Distribution:

bucketvalues
B0B_07, 18, 12
B1B_142, 30
B2B_255, 71
B3B_393

Sort buckets:

bucketsorted
B0B_07, 12, 18
B1B_130, 42
B2B_255, 71
B3B_393

Concatenate:

[7,12,18,30,42,55,71,93] [7, 12, 18, 30, 42, 55, 71, 93]

Correctness

The bucket ranges are ordered and non-overlapping. Each record is assigned to exactly one bucket whose range contains its key.

Each bucket is sorted internally. Since every key in an earlier bucket is less than or equal to every key in a later bucket, concatenating buckets in bucket order produces a globally sorted sequence.

Every record is written to one bucket and emitted once, so the result is a sorted permutation of the input.

Complexity

Let:

  • NN be the number of records
  • kk be the number of buckets
  • BB be the block size
  • NiN_i be the size of bucket ii

Distribution I/O:

O(NB) O\left(\frac{N}{B}\right)

Sorting cost:

i=0k1T(Ni) \sum_{i=0}^{k-1} T(N_i)

With balanced buckets and comparison sorting inside buckets:

O(Nlog(N/k)) O(N \log (N/k))

CPU time.

If buckets fit in memory, each bucket can be sorted internally and written once.

Design Choices

choiceeffect
number of bucketscontrols bucket size
bucket rangesdetermines balance
bufferingreduces small writes
recursive bucketinghandles oversized buckets
stable bucket sortpreserves equal-key order

When to Use

Use external bucket sort when:

  • keys have a known or estimable range
  • bucket boundaries can be chosen well
  • buckets can be sorted independently
  • sequential writes to bucket files are efficient
  • parallel bucket processing is useful

Avoid it when the key distribution is highly skewed and bucket ranges cannot be balanced.

Implementation Sketch

def bucket_index(x, min_key, max_key, bucket_count):
    if x == max_key:
        return bucket_count - 1

    width = (max_key - min_key + 1) / bucket_count
    return int((x - min_key) // width)
def external_bucket_sort_model(data, bucket_count):
    if not data:
        return []

    min_key = min(data)
    max_key = max(data)

    buckets = [[] for _ in range(bucket_count)]

    for x in data:
        i = bucket_index(x, min_key, max_key, bucket_count)
        buckets[i].append(x)

    out = []
    for bucket in buckets:
        out.extend(sorted(bucket))

    return out

Notes

External bucket sort replaces repeated merging with ordered partitioning. Its performance depends on bucket balance. With good buckets, the algorithm can sort large data using mostly sequential scans and independent local sorts.