External Sample Sort

External sample sort is a distribution-based external sorting algorithm. It selects a sample of the input to determine splitters, partitions the data into balanced buckets using those splitters, and then sorts each bucket independently.

The method reduces the number of merge passes by ensuring that buckets are roughly equal in size.

Problem

Given a large dataset that does not fit in memory, sort it efficiently while minimizing I/O passes and balancing work across partitions.

Core Idea

Take a sample of the data
sort the sample and choose splitters
partition the full dataset into buckets using splitters
sort each bucket independently
concatenate buckets in order

external_sample_sort(input):
    sample = sample_records(input)
    sort(sample)
    splitters = choose_splitters(sample)

    buckets = partition(input, splitters)

    for bucket in buckets:
        sort(bucket)

    return concatenate(buckets)

Sampling

Sampling estimates the distribution of keys.

sample_records(input):
    take random or regular samples
    return sample set

Splitters are chosen from the sorted sample so that each bucket receives roughly equal numbers of records.

Partitioning

Each record is assigned to a bucket based on splitters.

partition(input, splitters):
    create k buckets

    for record in input:
        i = bucket_index(record.key, splitters)
        append record to bucket i

    return buckets

The bucket index can be found using binary search.

Example

Input:

[42, 7, 18, 93, 55, 12, 71, 30]

Sample:

[7, 18, 42, 71]

Choose splitters:

25, 50

Buckets:

bucket	values
$B_0$	7, 18, 12
$B_1$	42, 30
$B_2$	93, 55, 71

Sort each bucket:

bucket	sorted
$B_0$	7, 12, 18
$B_1$	30, 42
$B_2$	55, 71, 93

Concatenate:

[7, 12, 18, 30, 42, 55, 71, 93]

Correctness

The splitters define ordered key ranges. Each record is placed into exactly one bucket whose range contains its key.

Each bucket is sorted independently. Since all keys in bucket $i$ are less than or equal to all keys in bucket $j$ for $i < j$ , concatenating buckets in order produces a globally sorted sequence.

Complexity

Let:

$N$ be the number of records
$k$ be the number of buckets
$B$ be the block size

Sampling cost:

O\left(\frac{N}{B}\right)

Partitioning cost:

O\left(\frac{N}{B}\right)

Sorting cost per bucket:

\sum_{i=1}^{k} O(N_i \log N_i)

With balanced buckets:

O(N \log (N/k))

Total I/O:

O\left(\frac{N}{B}\right)

for partitioning plus bucket sorting I/O.

Design Choices

choice	effect
sample size	larger sample improves balance
number of buckets $k$	controls bucket size
splitter selection	affects load distribution
recursive partitioning	handles oversized buckets
buffering	reduces I/O overhead

When to Use

Use external sample sort when:

key distribution can be estimated
balanced partitions are important
parallel processing of buckets is desired
merge passes should be minimized
dataset is very large

It is widely used in distributed systems and large-scale sorting frameworks.

Implementation Sketch

from bisect import bisect_left

def partition(data, splitters):
    buckets = [[] for _ in range(len(splitters) + 1)]

    for x in data:
        i = bisect_left(splitters, x)
        buckets[i].append(x)

    return buckets

def external_sample_sort(data):
    sample = sorted(data[::max(1, len(data)//4)])
    splitters = sample[1:-1]

    buckets = partition(data, splitters)

    result = []
    for bucket in buckets:
        result.extend(sorted(bucket))

    return result

Notes

External sample sort shifts work from merging to partitioning. Its performance depends heavily on good sampling. When buckets are balanced, sorting becomes efficient and parallelizable.