Distribution Sweeping Sort

Distribution sweeping sort is an external-memory sorting method based on partitioning. Instead of repeatedly merging sorted runs, it divides the input into key ranges, places each element into the correct range, sorts each range separately, and then concatenates the sorted ranges.

The method is useful when range partitioning can be done efficiently and each partition can later be sorted with bounded memory.

Problem

Given a dataset too large to fit in memory, sort it by key using external storage.

The algorithm assumes that keys can be partitioned into ordered buckets:

B_0, B_1, \ldots, B_{k-1}

such that every key in bucket $B_i$ is less than or equal to every key in bucket $B_j$ whenever $i < j$ .

Core Idea

Choose splitters that divide the key space into ordered ranges. Sweep through the input once and write each record to its corresponding range partition. Then sort each partition independently.

Since the partitions are ordered by key range, the final result is obtained by concatenating them in bucket order.

Algorithm

distribution_sweeping_sort(input, splitters):
    buckets = create_external_buckets(splitters)

    for record in input:
        i = bucket_index(record.key, splitters)
        append record to buckets[i]

    for bucket in buckets:
        sort_bucket(bucket)

    output = concatenate buckets in increasing key order
    return output

The bucket index can be found by binary search over splitters.

bucket_index(key, splitters):
    return first index i such that key <= splitters[i]

Example

Suppose the input is:

[42, 7, 18, 93, 55, 12, 71, 30]

Choose splitters:

25, 50, 75

This creates four buckets:

bucket	key range	values
$B_0$	$\leq 25$	7, 18, 12
$B_1$	26 to 50	42, 30
$B_2$	51 to 75	55, 71
$B_3$	$> 75$	93

Sort each bucket:

bucket	sorted values
$B_0$	7, 12, 18
$B_1$	30, 42
$B_2$	55, 71
$B_3$	93

Concatenate:

[7, 12, 18, 30, 42, 55, 71, 93]

Correctness

The splitters define ordered key ranges. Every record is assigned to exactly one bucket, and all keys in an earlier bucket are less than or equal to all keys in a later bucket.

Each bucket is sorted internally. Therefore, within each bucket the output order is sorted. Across buckets, the splitter order guarantees that no later bucket contains a key smaller than an earlier bucket. Concatenating the sorted buckets in range order gives a globally sorted sequence.

Complexity

Let:

$N$ be the number of elements
$k$ be the number of buckets
$B$ be the block size
$N_i$ be the size of bucket $i$

The distribution pass costs:

O\left(\frac{N}{B}\right)

I/O plus bucket writes.

If each bucket is sorted independently, the total sorting cost is:

\sum_{i=0}^{k-1} T(N_i)

With balanced buckets and comparison sorting inside buckets:

O(N \log (N/k))

CPU time, plus external I/O for bucket sorting.

Design Choices

choice	effect
good splitters	balanced buckets
too few buckets	large bucket sorts
too many buckets	many files and buffers
sampling	improves splitter quality
recursive partitioning	handles oversized buckets

When to Use

Use distribution sweeping sort when:

the key distribution can be estimated
ordered range partitioning is natural
partitions can be sorted independently
concatenation is cheaper than repeated merging
parallel or distributed bucket sorting is useful

It is less attractive when the key distribution is highly skewed and splitters are poor.

Implementation Sketch

from bisect import bisect_left

def distribution_sweeping_sort(a, splitters):
    buckets = [[] for _ in range(len(splitters) + 1)]

    for x in a:
        i = bisect_left(splitters, x)
        buckets[i].append(x)

    out = []
    for bucket in buckets:
        bucket.sort()
        out.extend(bucket)

    return out

Notes

Distribution sweeping sort is closely related to bucket sort, sample sort, and range-partitioned external sorting. Its main idea is to use ordering information before sorting, so the global order is mostly handled by partition placement rather than by repeated merge passes.