# Distribution Sweeping Sort

# Distribution Sweeping Sort

Distribution sweeping sort is an external-memory sorting method based on partitioning. Instead of repeatedly merging sorted runs, it divides the input into key ranges, places each element into the correct range, sorts each range separately, and then concatenates the sorted ranges.

The method is useful when range partitioning can be done efficiently and each partition can later be sorted with bounded memory.

## Problem

Given a dataset too large to fit in memory, sort it by key using external storage.

The algorithm assumes that keys can be partitioned into ordered buckets:

$$
B_0, B_1, \ldots, B_{k-1}
$$

such that every key in bucket $B_i$ is less than or equal to every key in bucket $B_j$ whenever $i < j$.

## Core Idea

Choose splitters that divide the key space into ordered ranges. Sweep through the input once and write each record to its corresponding range partition. Then sort each partition independently.

Since the partitions are ordered by key range, the final result is obtained by concatenating them in bucket order.

## Algorithm

```text id="hiu7sx"
distribution_sweeping_sort(input, splitters):
    buckets = create_external_buckets(splitters)

    for record in input:
        i = bucket_index(record.key, splitters)
        append record to buckets[i]

    for bucket in buckets:
        sort_bucket(bucket)

    output = concatenate buckets in increasing key order
    return output
```

The bucket index can be found by binary search over splitters.

```text id="7n6qyo"
bucket_index(key, splitters):
    return first index i such that key <= splitters[i]
```

## Example

Suppose the input is:

$$
[42, 7, 18, 93, 55, 12, 71, 30]
$$

Choose splitters:

$$
25, 50, 75
$$

This creates four buckets:

| bucket | key range | values    |
| ------ | --------- | --------- |
| $B_0$  | $\leq 25$ | 7, 18, 12 |
| $B_1$  | 26 to 50  | 42, 30    |
| $B_2$  | 51 to 75  | 55, 71    |
| $B_3$  | $> 75$    | 93        |

Sort each bucket:

| bucket | sorted values |
| ------ | ------------- |
| $B_0$  | 7, 12, 18     |
| $B_1$  | 30, 42        |
| $B_2$  | 55, 71        |
| $B_3$  | 93            |

Concatenate:

$$
[7, 12, 18, 30, 42, 55, 71, 93]
$$

## Correctness

The splitters define ordered key ranges. Every record is assigned to exactly one bucket, and all keys in an earlier bucket are less than or equal to all keys in a later bucket.

Each bucket is sorted internally. Therefore, within each bucket the output order is sorted. Across buckets, the splitter order guarantees that no later bucket contains a key smaller than an earlier bucket. Concatenating the sorted buckets in range order gives a globally sorted sequence.

## Complexity

Let:

* $N$ be the number of elements
* $k$ be the number of buckets
* $B$ be the block size
* $N_i$ be the size of bucket $i$

The distribution pass costs:

$$
O\left(\frac{N}{B}\right)
$$

I/O plus bucket writes.

If each bucket is sorted independently, the total sorting cost is:

$$
\sum_{i=0}^{k-1} T(N_i)
$$

With balanced buckets and comparison sorting inside buckets:

$$
O(N \log (N/k))
$$

CPU time, plus external I/O for bucket sorting.

## Design Choices

| choice                 | effect                    |
| ---------------------- | ------------------------- |
| good splitters         | balanced buckets          |
| too few buckets        | large bucket sorts        |
| too many buckets       | many files and buffers    |
| sampling               | improves splitter quality |
| recursive partitioning | handles oversized buckets |

## When to Use

Use distribution sweeping sort when:

* the key distribution can be estimated
* ordered range partitioning is natural
* partitions can be sorted independently
* concatenation is cheaper than repeated merging
* parallel or distributed bucket sorting is useful

It is less attractive when the key distribution is highly skewed and splitters are poor.

## Implementation Sketch

```python id="m5xd2q"
from bisect import bisect_left

def distribution_sweeping_sort(a, splitters):
    buckets = [[] for _ in range(len(splitters) + 1)]

    for x in a:
        i = bisect_left(splitters, x)
        buckets[i].append(x)

    out = []
    for bucket in buckets:
        bucket.sort()
        out.extend(bucket)

    return out
```

## Notes

Distribution sweeping sort is closely related to bucket sort, sample sort, and range-partitioned external sorting. Its main idea is to use ordering information before sorting, so the global order is mostly handled by partition placement rather than by repeated merge passes.

