Skip to content

External Sample Sort

External-memory sorting algorithm that uses sampling to partition data into balanced buckets, then sorts each bucket independently.

External sample sort is a distribution-based external sorting algorithm. It selects a sample of the input to determine splitters, partitions the data into balanced buckets using those splitters, and then sorts each bucket independently.

The method reduces the number of merge passes by ensuring that buckets are roughly equal in size.

Problem

Given a large dataset that does not fit in memory, sort it efficiently while minimizing I/O passes and balancing work across partitions.

Core Idea

  1. Take a sample of the data
  2. sort the sample and choose splitters
  3. partition the full dataset into buckets using splitters
  4. sort each bucket independently
  5. concatenate buckets in order
external_sample_sort(input):
    sample = sample_records(input)
    sort(sample)
    splitters = choose_splitters(sample)

    buckets = partition(input, splitters)

    for bucket in buckets:
        sort(bucket)

    return concatenate(buckets)

Sampling

Sampling estimates the distribution of keys.

sample_records(input):
    take random or regular samples
    return sample set

Splitters are chosen from the sorted sample so that each bucket receives roughly equal numbers of records.

Partitioning

Each record is assigned to a bucket based on splitters.

partition(input, splitters):
    create k buckets

    for record in input:
        i = bucket_index(record.key, splitters)
        append record to bucket i

    return buckets

The bucket index can be found using binary search.

Example

Input:

[42,7,18,93,55,12,71,30] [42, 7, 18, 93, 55, 12, 71, 30]

Sample:

[7,18,42,71] [7, 18, 42, 71]

Choose splitters:

25,50 25, 50

Buckets:

bucketvalues
B0B_07, 18, 12
B1B_142, 30
B2B_293, 55, 71

Sort each bucket:

bucketsorted
B0B_07, 12, 18
B1B_130, 42
B2B_255, 71, 93

Concatenate:

[7,12,18,30,42,55,71,93] [7, 12, 18, 30, 42, 55, 71, 93]

Correctness

The splitters define ordered key ranges. Each record is placed into exactly one bucket whose range contains its key.

Each bucket is sorted independently. Since all keys in bucket ii are less than or equal to all keys in bucket jj for i<ji < j, concatenating buckets in order produces a globally sorted sequence.

Complexity

Let:

  • NN be the number of records
  • kk be the number of buckets
  • BB be the block size

Sampling cost:

O(NB) O\left(\frac{N}{B}\right)

Partitioning cost:

O(NB) O\left(\frac{N}{B}\right)

Sorting cost per bucket:

i=1kO(NilogNi) \sum_{i=1}^{k} O(N_i \log N_i)

With balanced buckets:

O(Nlog(N/k)) O(N \log (N/k))

Total I/O:

O(NB) O\left(\frac{N}{B}\right)

for partitioning plus bucket sorting I/O.

Design Choices

choiceeffect
sample sizelarger sample improves balance
number of buckets kkcontrols bucket size
splitter selectionaffects load distribution
recursive partitioninghandles oversized buckets
bufferingreduces I/O overhead

When to Use

Use external sample sort when:

  • key distribution can be estimated
  • balanced partitions are important
  • parallel processing of buckets is desired
  • merge passes should be minimized
  • dataset is very large

It is widely used in distributed systems and large-scale sorting frameworks.

Implementation Sketch

from bisect import bisect_left

def partition(data, splitters):
    buckets = [[] for _ in range(len(splitters) + 1)]

    for x in data:
        i = bisect_left(splitters, x)
        buckets[i].append(x)

    return buckets
def external_sample_sort(data):
    sample = sorted(data[::max(1, len(data)//4)])
    splitters = sample[1:-1]

    buckets = partition(data, splitters)

    result = []
    for bucket in buckets:
        result.extend(sorted(bucket))

    return result

Notes

External sample sort shifts work from merging to partitioning. Its performance depends heavily on good sampling. When buckets are balanced, sorting becomes efficient and parallelizable.