External-memory sorting method that partitions data into key ranges, sorts each range, and concatenates the sorted partitions.
Distribution sweeping sort is an external-memory sorting method based on partitioning. Instead of repeatedly merging sorted runs, it divides the input into key ranges, places each element into the correct range, sorts each range separately, and then concatenates the sorted ranges.
The method is useful when range partitioning can be done efficiently and each partition can later be sorted with bounded memory.
Problem
Given a dataset too large to fit in memory, sort it by key using external storage.
The algorithm assumes that keys can be partitioned into ordered buckets:
such that every key in bucket is less than or equal to every key in bucket whenever .
Core Idea
Choose splitters that divide the key space into ordered ranges. Sweep through the input once and write each record to its corresponding range partition. Then sort each partition independently.
Since the partitions are ordered by key range, the final result is obtained by concatenating them in bucket order.
Algorithm
distribution_sweeping_sort(input, splitters):
buckets = create_external_buckets(splitters)
for record in input:
i = bucket_index(record.key, splitters)
append record to buckets[i]
for bucket in buckets:
sort_bucket(bucket)
output = concatenate buckets in increasing key order
return outputThe bucket index can be found by binary search over splitters.
bucket_index(key, splitters):
return first index i such that key <= splitters[i]Example
Suppose the input is:
Choose splitters:
This creates four buckets:
| bucket | key range | values |
|---|---|---|
| 7, 18, 12 | ||
| 26 to 50 | 42, 30 | |
| 51 to 75 | 55, 71 | |
| 93 |
Sort each bucket:
| bucket | sorted values |
|---|---|
| 7, 12, 18 | |
| 30, 42 | |
| 55, 71 | |
| 93 |
Concatenate:
Correctness
The splitters define ordered key ranges. Every record is assigned to exactly one bucket, and all keys in an earlier bucket are less than or equal to all keys in a later bucket.
Each bucket is sorted internally. Therefore, within each bucket the output order is sorted. Across buckets, the splitter order guarantees that no later bucket contains a key smaller than an earlier bucket. Concatenating the sorted buckets in range order gives a globally sorted sequence.
Complexity
Let:
- be the number of elements
- be the number of buckets
- be the block size
- be the size of bucket
The distribution pass costs:
I/O plus bucket writes.
If each bucket is sorted independently, the total sorting cost is:
With balanced buckets and comparison sorting inside buckets:
CPU time, plus external I/O for bucket sorting.
Design Choices
| choice | effect |
|---|---|
| good splitters | balanced buckets |
| too few buckets | large bucket sorts |
| too many buckets | many files and buffers |
| sampling | improves splitter quality |
| recursive partitioning | handles oversized buckets |
When to Use
Use distribution sweeping sort when:
- the key distribution can be estimated
- ordered range partitioning is natural
- partitions can be sorted independently
- concatenation is cheaper than repeated merging
- parallel or distributed bucket sorting is useful
It is less attractive when the key distribution is highly skewed and splitters are poor.
Implementation Sketch
from bisect import bisect_left
def distribution_sweeping_sort(a, splitters):
buckets = [[] for _ in range(len(splitters) + 1)]
for x in a:
i = bisect_left(splitters, x)
buckets[i].append(x)
out = []
for bucket in buckets:
bucket.sort()
out.extend(bucket)
return outNotes
Distribution sweeping sort is closely related to bucket sort, sample sort, and range-partitioned external sorting. Its main idea is to use ordering information before sorting, so the global order is mostly handled by partition placement rather than by repeated merge passes.