Skip to content

Distribution Sweeping Sort

External-memory sorting method that partitions data into key ranges, sorts each range, and concatenates the sorted partitions.

Distribution sweeping sort is an external-memory sorting method based on partitioning. Instead of repeatedly merging sorted runs, it divides the input into key ranges, places each element into the correct range, sorts each range separately, and then concatenates the sorted ranges.

The method is useful when range partitioning can be done efficiently and each partition can later be sorted with bounded memory.

Problem

Given a dataset too large to fit in memory, sort it by key using external storage.

The algorithm assumes that keys can be partitioned into ordered buckets:

B0,B1,,Bk1 B_0, B_1, \ldots, B_{k-1}

such that every key in bucket BiB_i is less than or equal to every key in bucket BjB_j whenever i<ji < j.

Core Idea

Choose splitters that divide the key space into ordered ranges. Sweep through the input once and write each record to its corresponding range partition. Then sort each partition independently.

Since the partitions are ordered by key range, the final result is obtained by concatenating them in bucket order.

Algorithm

distribution_sweeping_sort(input, splitters):
    buckets = create_external_buckets(splitters)

    for record in input:
        i = bucket_index(record.key, splitters)
        append record to buckets[i]

    for bucket in buckets:
        sort_bucket(bucket)

    output = concatenate buckets in increasing key order
    return output

The bucket index can be found by binary search over splitters.

bucket_index(key, splitters):
    return first index i such that key <= splitters[i]

Example

Suppose the input is:

[42,7,18,93,55,12,71,30] [42, 7, 18, 93, 55, 12, 71, 30]

Choose splitters:

25,50,75 25, 50, 75

This creates four buckets:

bucketkey rangevalues
B0B_025\leq 257, 18, 12
B1B_126 to 5042, 30
B2B_251 to 7555, 71
B3B_3>75> 7593

Sort each bucket:

bucketsorted values
B0B_07, 12, 18
B1B_130, 42
B2B_255, 71
B3B_393

Concatenate:

[7,12,18,30,42,55,71,93] [7, 12, 18, 30, 42, 55, 71, 93]

Correctness

The splitters define ordered key ranges. Every record is assigned to exactly one bucket, and all keys in an earlier bucket are less than or equal to all keys in a later bucket.

Each bucket is sorted internally. Therefore, within each bucket the output order is sorted. Across buckets, the splitter order guarantees that no later bucket contains a key smaller than an earlier bucket. Concatenating the sorted buckets in range order gives a globally sorted sequence.

Complexity

Let:

  • NN be the number of elements
  • kk be the number of buckets
  • BB be the block size
  • NiN_i be the size of bucket ii

The distribution pass costs:

O(NB) O\left(\frac{N}{B}\right)

I/O plus bucket writes.

If each bucket is sorted independently, the total sorting cost is:

i=0k1T(Ni) \sum_{i=0}^{k-1} T(N_i)

With balanced buckets and comparison sorting inside buckets:

O(Nlog(N/k)) O(N \log (N/k))

CPU time, plus external I/O for bucket sorting.

Design Choices

choiceeffect
good splittersbalanced buckets
too few bucketslarge bucket sorts
too many bucketsmany files and buffers
samplingimproves splitter quality
recursive partitioninghandles oversized buckets

When to Use

Use distribution sweeping sort when:

  • the key distribution can be estimated
  • ordered range partitioning is natural
  • partitions can be sorted independently
  • concatenation is cheaper than repeated merging
  • parallel or distributed bucket sorting is useful

It is less attractive when the key distribution is highly skewed and splitters are poor.

Implementation Sketch

from bisect import bisect_left

def distribution_sweeping_sort(a, splitters):
    buckets = [[] for _ in range(len(splitters) + 1)]

    for x in a:
        i = bisect_left(splitters, x)
        buckets[i].append(x)

    out = []
    for bucket in buckets:
        bucket.sort()
        out.extend(bucket)

    return out

Notes

Distribution sweeping sort is closely related to bucket sort, sample sort, and range-partitioned external sorting. Its main idea is to use ordering information before sorting, so the global order is mostly handled by partition placement rather than by repeated merge passes.