Skip to content

Parallel Sample Sort

Choose splitters from samples, partition the input into buckets, sort buckets independently, then concatenate the sorted buckets.

Parallel sample sort is a parallel sorting algorithm based on partitioning by sampled splitters. It first chooses representative keys from the input, uses them to divide the data into buckets, sorts each bucket independently, then concatenates the buckets in order.

The method is useful when many processors are available because most work after partitioning is independent. Its quality depends on how well the samples represent the input distribution.

Problem

Given an array AA of size nn, sort all elements in nondecreasing order using pp parallel workers.

Algorithm

Choose samples from the input, sort the samples, and select splitters. The splitters define bucket boundaries. Each element is assigned to one bucket. Buckets are then sorted independently.

parallel_sample_sort(A, p):
    choose sample keys from A
    sort the samples

    choose p - 1 splitters from the sorted samples

    create p empty buckets

    parallel for each element x in A:
        b = bucket_index(x, splitters)
        append x to bucket b

    parallel for each bucket:
        sort bucket

    concatenate buckets in splitter order

The splitters must satisfy:

s1s2sp1 s_1 \le s_2 \le \cdots \le s_{p-1}

Elements in bucket 00 are at most s1s_1. Elements in bucket p1p-1 are greater than or equal to sp1s_{p-1}.

Bucket Lookup

A bucket can be found by binary search over the splitter array.

bucket_index(x, splitters):
    return upper_bound(splitters, x)

For pp buckets, this costs:

O(logp) O(\log p)

per element.

Complexity

measurevalue
samplingO(plogp)O(p \log p) or smaller
partitioningO(nlogp)O(n \log p) with binary search
bucket sortingabout O((n/p)log(n/p))O((n/p)\log(n/p)) per balanced bucket
total workO(nlogn)O(n \log n)
ideal parallel timeO((n/p)log(n/p)+n/p)O((n/p)\log(n/p) + n/p)

When buckets are well balanced, the algorithm scales well. When buckets are skewed, one worker may receive a much larger bucket and dominate runtime.

Correctness

The splitters impose a global order on the buckets. Every element in an earlier bucket is less than or equal to every element in a later bucket. Each bucket is sorted locally. Concatenating locally sorted buckets in splitter order therefore produces one globally sorted array.

Practical Considerations

  • Oversampling improves bucket balance.
  • Duplicate heavy inputs need careful bucket rules.
  • Buckets should be built with prefix sums or per worker buffers to avoid contention.
  • Local bucket sorting can use quicksort, mergesort, radix sort, or introsort.
  • Communication cost matters in distributed implementations.

When to Use

Use parallel sample sort when:

  • the input is large
  • many workers are available
  • the data can be redistributed efficiently
  • approximate balance can be achieved by sampling

Avoid it when the input is small, communication is expensive, or the data distribution causes severe bucket imbalance.

Implementation (Go, simplified)

func ParallelSampleSort(a []int, workers int) []int {
	if len(a) <= 1 || workers <= 1 {
		out := append([]int(nil), a...)
		sort.Ints(out)
		return out
	}

	splitters := chooseSplitters(a, workers)

	buckets := make([][]int, workers)
	for _, x := range a {
		b := sort.Search(len(splitters), func(i int) bool {
			return x <= splitters[i]
		})
		buckets[b] = append(buckets[b], x)
	}

	done := make(chan struct{}, workers)

	for i := range buckets {
		go func(i int) {
			sort.Ints(buckets[i])
			done <- struct{}{}
		}(i)
	}

	for range buckets {
		<-done
	}

	out := make([]int, 0, len(a))
	for _, b := range buckets {
		out = append(out, b...)
	}

	return out
}

func chooseSplitters(a []int, workers int) []int {
	sample := append([]int(nil), a...)
	sort.Ints(sample)

	splitters := make([]int, 0, workers-1)
	for i := 1; i < workers; i++ {
		idx := i * len(sample) / workers
		splitters = append(splitters, sample[idx])
	}
	return splitters
}