Skip to content

GPU Sample Sort

Sort data on a GPU by selecting splitters, partitioning into buckets in parallel, then sorting buckets independently.

GPU sample sort partitions the input into buckets using sampled splitters, then sorts each bucket independently on the GPU. It is a distribution based algorithm similar to parallel sample sort, adapted to GPU execution with attention to memory coalescing and load balancing.

This method scales well for large inputs because most work after partitioning becomes independent across buckets.

Problem

Given an array AA of size nn on a GPU, sort the elements in nondecreasing order.

Algorithm

The algorithm proceeds in four stages: sampling, partitioning, bucket sorting, and concatenation.

gpu_sample_sort(A, p):
    select sample keys from A
    sort the samples
    choose p - 1 splitters

    parallel assign each element to a bucket using splitters

    compute bucket sizes using parallel prefix sums

    scatter elements into bucket regions

    parallel sort each bucket

    return concatenation of buckets

Splitters define bucket boundaries:

s1s2sp1 s_1 \le s_2 \le \cdots \le s_{p-1}

Each element xx belongs to bucket ii such that:

six<si+1 s_i \le x < s_{i+1}

Bucket Assignment

Bucket index is computed using binary search over splitters.

bucket_index(x, splitters):
    return upper_bound(splitters, x)

On GPU, this is often implemented using warp wide binary search or small linear scans for performance.

Prefix Sum for Placement

After counting bucket sizes, compute offsets:

offset[i]=j<icount[j] offset[i] = \sum_{j < i} count[j]

Each bucket occupies a contiguous region in the output array.

Complexity

measurevalue
samplingO(plogp)O(p \log p)
partitioningO(nlogp)O(n \log p)
bucket sortingO((n/p)log(n/p))O((n/p)\log(n/p)) expected
total workO(nlogn)O(n \log n)
parallel timeO((n/p)log(n/p))O((n/p)\log(n/p))

Balance depends on splitter quality.

Correctness

Splitters divide the input domain into ordered intervals. All elements in earlier buckets are less than or equal to elements in later buckets. Sorting each bucket produces locally sorted segments. Concatenating buckets in order yields a globally sorted array.

Practical Considerations

  • Oversampling improves bucket balance.
  • Use shared memory buffers for local partitioning.
  • Avoid contention when writing bucket sizes by using per block counts.
  • Load imbalance occurs if data distribution is skewed.
  • Bucket sorting can use radix sort or bitonic sort depending on size.
  • Communication between blocks must be minimized.

When to Use

Use GPU sample sort when:

  • input is very large
  • multiple GPU blocks can be used effectively
  • data distribution can be approximated by sampling
  • high throughput sorting is required

Avoid it when:

  • input is small
  • distribution is highly skewed
  • simpler GPU radix sort is applicable

Implementation Sketch

__global__ void assign_buckets(
    const int *in,
    int *bucket_idx,
    const int *splitters,
    int p,
    int n
) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    int x = in[i];

    int lo = 0, hi = p - 1;
    while (lo < hi) {
        int mid = (lo + hi) / 2;
        if (x <= splitters[mid]) {
            hi = mid;
        } else {
            lo = mid + 1;
        }
    }

    bucket_idx[i] = lo;
}
__global__ void scatter_to_buckets(
    const int *in,
    int *out,
    const int *bucket_idx,
    int *offsets,
    int n
) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    int b = bucket_idx[i];
    int pos = atomicAdd(&offsets[b], 1);
    out[pos] = in[i];
}