# Parallel Sample Sort

# Parallel Sample Sort

Parallel sample sort is a parallel sorting algorithm based on partitioning by sampled splitters. It first chooses representative keys from the input, uses them to divide the data into buckets, sorts each bucket independently, then concatenates the buckets in order.

The method is useful when many processors are available because most work after partitioning is independent. Its quality depends on how well the samples represent the input distribution.

## Problem

Given an array $A$ of size $n$, sort all elements in nondecreasing order using $p$ parallel workers.

## Algorithm

Choose samples from the input, sort the samples, and select splitters. The splitters define bucket boundaries. Each element is assigned to one bucket. Buckets are then sorted independently.

```text id="2u2m08"
parallel_sample_sort(A, p):
    choose sample keys from A
    sort the samples

    choose p - 1 splitters from the sorted samples

    create p empty buckets

    parallel for each element x in A:
        b = bucket_index(x, splitters)
        append x to bucket b

    parallel for each bucket:
        sort bucket

    concatenate buckets in splitter order
```

The splitters must satisfy:

$$
s_1 \le s_2 \le \cdots \le s_{p-1}
$$

Elements in bucket $0$ are at most $s_1$. Elements in bucket $p-1$ are greater than or equal to $s_{p-1}$.

## Bucket Lookup

A bucket can be found by binary search over the splitter array.

```text id="sljfr8"
bucket_index(x, splitters):
    return upper_bound(splitters, x)
```

For $p$ buckets, this costs:

$$
O(\log p)
$$

per element.

## Complexity

| measure             | value                                         |
| ------------------- | --------------------------------------------- |
| sampling            | $O(p \log p)$ or smaller                      |
| partitioning        | $O(n \log p)$ with binary search              |
| bucket sorting      | about $O((n/p)\log(n/p))$ per balanced bucket |
| total work          | $O(n \log n)$                                 |
| ideal parallel time | $O((n/p)\log(n/p) + n/p)$                     |

When buckets are well balanced, the algorithm scales well. When buckets are skewed, one worker may receive a much larger bucket and dominate runtime.

## Correctness

The splitters impose a global order on the buckets. Every element in an earlier bucket is less than or equal to every element in a later bucket. Each bucket is sorted locally. Concatenating locally sorted buckets in splitter order therefore produces one globally sorted array.

## Practical Considerations

* Oversampling improves bucket balance.
* Duplicate heavy inputs need careful bucket rules.
* Buckets should be built with prefix sums or per worker buffers to avoid contention.
* Local bucket sorting can use quicksort, mergesort, radix sort, or introsort.
* Communication cost matters in distributed implementations.

## When to Use

Use parallel sample sort when:

* the input is large
* many workers are available
* the data can be redistributed efficiently
* approximate balance can be achieved by sampling

Avoid it when the input is small, communication is expensive, or the data distribution causes severe bucket imbalance.

## Implementation (Go, simplified)

```go id="33brzu"
func ParallelSampleSort(a []int, workers int) []int {
	if len(a) <= 1 || workers <= 1 {
		out := append([]int(nil), a...)
		sort.Ints(out)
		return out
	}

	splitters := chooseSplitters(a, workers)

	buckets := make([][]int, workers)
	for _, x := range a {
		b := sort.Search(len(splitters), func(i int) bool {
			return x <= splitters[i]
		})
		buckets[b] = append(buckets[b], x)
	}

	done := make(chan struct{}, workers)

	for i := range buckets {
		go func(i int) {
			sort.Ints(buckets[i])
			done <- struct{}{}
		}(i)
	}

	for range buckets {
		<-done
	}

	out := make([]int, 0, len(a))
	for _, b := range buckets {
		out = append(out, b...)
	}

	return out
}

func chooseSplitters(a []int, workers int) []int {
	sample := append([]int(nil), a...)
	sort.Ints(sample)

	splitters := make([]int, 0, workers-1)
	for i := 1; i < workers; i++ {
		idx := i * len(sample) / workers
		splitters = append(splitters, sample[idx])
	}
	return splitters
}
```

