External-memory sorting algorithm that uses sampling to partition data into balanced buckets, then sorts each bucket independently.
External sample sort is a distribution-based external sorting algorithm. It selects a sample of the input to determine splitters, partitions the data into balanced buckets using those splitters, and then sorts each bucket independently.
The method reduces the number of merge passes by ensuring that buckets are roughly equal in size.
Problem
Given a large dataset that does not fit in memory, sort it efficiently while minimizing I/O passes and balancing work across partitions.
Core Idea
- Take a sample of the data
- sort the sample and choose splitters
- partition the full dataset into buckets using splitters
- sort each bucket independently
- concatenate buckets in order
external_sample_sort(input):
sample = sample_records(input)
sort(sample)
splitters = choose_splitters(sample)
buckets = partition(input, splitters)
for bucket in buckets:
sort(bucket)
return concatenate(buckets)Sampling
Sampling estimates the distribution of keys.
sample_records(input):
take random or regular samples
return sample setSplitters are chosen from the sorted sample so that each bucket receives roughly equal numbers of records.
Partitioning
Each record is assigned to a bucket based on splitters.
partition(input, splitters):
create k buckets
for record in input:
i = bucket_index(record.key, splitters)
append record to bucket i
return bucketsThe bucket index can be found using binary search.
Example
Input:
Sample:
Choose splitters:
Buckets:
| bucket | values |
|---|---|
| 7, 18, 12 | |
| 42, 30 | |
| 93, 55, 71 |
Sort each bucket:
| bucket | sorted |
|---|---|
| 7, 12, 18 | |
| 30, 42 | |
| 55, 71, 93 |
Concatenate:
Correctness
The splitters define ordered key ranges. Each record is placed into exactly one bucket whose range contains its key.
Each bucket is sorted independently. Since all keys in bucket are less than or equal to all keys in bucket for , concatenating buckets in order produces a globally sorted sequence.
Complexity
Let:
- be the number of records
- be the number of buckets
- be the block size
Sampling cost:
Partitioning cost:
Sorting cost per bucket:
With balanced buckets:
Total I/O:
for partitioning plus bucket sorting I/O.
Design Choices
| choice | effect |
|---|---|
| sample size | larger sample improves balance |
| number of buckets | controls bucket size |
| splitter selection | affects load distribution |
| recursive partitioning | handles oversized buckets |
| buffering | reduces I/O overhead |
When to Use
Use external sample sort when:
- key distribution can be estimated
- balanced partitions are important
- parallel processing of buckets is desired
- merge passes should be minimized
- dataset is very large
It is widely used in distributed systems and large-scale sorting frameworks.
Implementation Sketch
from bisect import bisect_left
def partition(data, splitters):
buckets = [[] for _ in range(len(splitters) + 1)]
for x in data:
i = bisect_left(splitters, x)
buckets[i].append(x)
return bucketsdef external_sample_sort(data):
sample = sorted(data[::max(1, len(data)//4)])
splitters = sample[1:-1]
buckets = partition(data, splitters)
result = []
for bucket in buckets:
result.extend(sorted(bucket))
return resultNotes
External sample sort shifts work from merging to partitioning. Its performance depends heavily on good sampling. When buckets are balanced, sorting becomes efficient and parallelizable.