# External Sample Sort

# External Sample Sort

External sample sort is a distribution-based external sorting algorithm. It selects a sample of the input to determine splitters, partitions the data into balanced buckets using those splitters, and then sorts each bucket independently.

The method reduces the number of merge passes by ensuring that buckets are roughly equal in size.

## Problem

Given a large dataset that does not fit in memory, sort it efficiently while minimizing I/O passes and balancing work across partitions.

## Core Idea

1. Take a sample of the data
2. sort the sample and choose splitters
3. partition the full dataset into buckets using splitters
4. sort each bucket independently
5. concatenate buckets in order

```text id="9s6k3q"
external_sample_sort(input):
    sample = sample_records(input)
    sort(sample)
    splitters = choose_splitters(sample)

    buckets = partition(input, splitters)

    for bucket in buckets:
        sort(bucket)

    return concatenate(buckets)
```

## Sampling

Sampling estimates the distribution of keys.

```text id="3u7y1p"
sample_records(input):
    take random or regular samples
    return sample set
```

Splitters are chosen from the sorted sample so that each bucket receives roughly equal numbers of records.

## Partitioning

Each record is assigned to a bucket based on splitters.

```text id="5r8k0m"
partition(input, splitters):
    create k buckets

    for record in input:
        i = bucket_index(record.key, splitters)
        append record to bucket i

    return buckets
```

The bucket index can be found using binary search.

## Example

Input:

$$
[42, 7, 18, 93, 55, 12, 71, 30]
$$

Sample:

$$
[7, 18, 42, 71]
$$

Choose splitters:

$$
25, 50
$$

Buckets:

| bucket | values     |
| ------ | ---------- |
| $B_0$  | 7, 18, 12  |
| $B_1$  | 42, 30     |
| $B_2$  | 93, 55, 71 |

Sort each bucket:

| bucket | sorted     |
| ------ | ---------- |
| $B_0$  | 7, 12, 18  |
| $B_1$  | 30, 42     |
| $B_2$  | 55, 71, 93 |

Concatenate:

$$
[7, 12, 18, 30, 42, 55, 71, 93]
$$

## Correctness

The splitters define ordered key ranges. Each record is placed into exactly one bucket whose range contains its key.

Each bucket is sorted independently. Since all keys in bucket $i$ are less than or equal to all keys in bucket $j$ for $i < j$, concatenating buckets in order produces a globally sorted sequence.

## Complexity

Let:

* $N$ be the number of records
* $k$ be the number of buckets
* $B$ be the block size

Sampling cost:

$$
O\left(\frac{N}{B}\right)
$$

Partitioning cost:

$$
O\left(\frac{N}{B}\right)
$$

Sorting cost per bucket:

$$
\sum_{i=1}^{k} O(N_i \log N_i)
$$

With balanced buckets:

$$
O(N \log (N/k))
$$

Total I/O:

$$
O\left(\frac{N}{B}\right)
$$

for partitioning plus bucket sorting I/O.

## Design Choices

| choice                 | effect                         |
| ---------------------- | ------------------------------ |
| sample size            | larger sample improves balance |
| number of buckets $k$  | controls bucket size           |
| splitter selection     | affects load distribution      |
| recursive partitioning | handles oversized buckets      |
| buffering              | reduces I/O overhead           |

## When to Use

Use external sample sort when:

* key distribution can be estimated
* balanced partitions are important
* parallel processing of buckets is desired
* merge passes should be minimized
* dataset is very large

It is widely used in distributed systems and large-scale sorting frameworks.

## Implementation Sketch

```python id="k8v3zt"
from bisect import bisect_left

def partition(data, splitters):
    buckets = [[] for _ in range(len(splitters) + 1)]

    for x in data:
        i = bisect_left(splitters, x)
        buckets[i].append(x)

    return buckets
```

```python id="g2m7qx"
def external_sample_sort(data):
    sample = sorted(data[::max(1, len(data)//4)])
    splitters = sample[1:-1]

    buckets = partition(data, splitters)

    result = []
    for bucket in buckets:
        result.extend(sorted(bucket))

    return result
```

## Notes

External sample sort shifts work from merging to partitioning. Its performance depends heavily on good sampling. When buckets are balanced, sorting becomes efficient and parallelizable.

