# GPU Sample Sort

# GPU Sample Sort

GPU sample sort partitions the input into buckets using sampled splitters, then sorts each bucket independently on the GPU. It is a distribution based algorithm similar to parallel sample sort, adapted to GPU execution with attention to memory coalescing and load balancing.

This method scales well for large inputs because most work after partitioning becomes independent across buckets.

## Problem

Given an array $A$ of size $n$ on a GPU, sort the elements in nondecreasing order.

## Algorithm

The algorithm proceeds in four stages: sampling, partitioning, bucket sorting, and concatenation.

```text id="v9m2kc"
gpu_sample_sort(A, p):
    select sample keys from A
    sort the samples
    choose p - 1 splitters

    parallel assign each element to a bucket using splitters

    compute bucket sizes using parallel prefix sums

    scatter elements into bucket regions

    parallel sort each bucket

    return concatenation of buckets
```

Splitters define bucket boundaries:

$$
s_1 \le s_2 \le \cdots \le s_{p-1}
$$

Each element $x$ belongs to bucket $i$ such that:

$$
s_i \le x < s_{i+1}
$$

## Bucket Assignment

Bucket index is computed using binary search over splitters.

```text id="9skl1r"
bucket_index(x, splitters):
    return upper_bound(splitters, x)
```

On GPU, this is often implemented using warp wide binary search or small linear scans for performance.

## Prefix Sum for Placement

After counting bucket sizes, compute offsets:

$$
offset[i] = \sum_{j < i} count[j]
$$

Each bucket occupies a contiguous region in the output array.

## Complexity

| measure        | value                        |
| -------------- | ---------------------------- |
| sampling       | $O(p \log p)$                |
| partitioning   | $O(n \log p)$                |
| bucket sorting | $O((n/p)\log(n/p))$ expected |
| total work     | $O(n \log n)$                |
| parallel time  | $O((n/p)\log(n/p))$          |

Balance depends on splitter quality.

## Correctness

Splitters divide the input domain into ordered intervals. All elements in earlier buckets are less than or equal to elements in later buckets. Sorting each bucket produces locally sorted segments. Concatenating buckets in order yields a globally sorted array.

## Practical Considerations

* Oversampling improves bucket balance.
* Use shared memory buffers for local partitioning.
* Avoid contention when writing bucket sizes by using per block counts.
* Load imbalance occurs if data distribution is skewed.
* Bucket sorting can use radix sort or bitonic sort depending on size.
* Communication between blocks must be minimized.

## When to Use

Use GPU sample sort when:

* input is very large
* multiple GPU blocks can be used effectively
* data distribution can be approximated by sampling
* high throughput sorting is required

Avoid it when:

* input is small
* distribution is highly skewed
* simpler GPU radix sort is applicable

## Implementation Sketch

```cuda id="q4b8je"
__global__ void assign_buckets(
    const int *in,
    int *bucket_idx,
    const int *splitters,
    int p,
    int n
) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    int x = in[i];

    int lo = 0, hi = p - 1;
    while (lo < hi) {
        int mid = (lo + hi) / 2;
        if (x <= splitters[mid]) {
            hi = mid;
        } else {
            lo = mid + 1;
        }
    }

    bucket_idx[i] = lo;
}
```

```cuda id="3v8ywd"
__global__ void scatter_to_buckets(
    const int *in,
    int *out,
    const int *bucket_idx,
    int *offsets,
    int n
) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    int b = bucket_idx[i];
    int pos = atomicAdd(&offsets[b], 1);
    out[pos] = in[i];
}
```

