Sort data on a GPU by selecting splitters, partitioning into buckets in parallel, then sorting buckets independently.
GPU sample sort partitions the input into buckets using sampled splitters, then sorts each bucket independently on the GPU. It is a distribution based algorithm similar to parallel sample sort, adapted to GPU execution with attention to memory coalescing and load balancing.
This method scales well for large inputs because most work after partitioning becomes independent across buckets.
Problem
Given an array of size on a GPU, sort the elements in nondecreasing order.
Algorithm
The algorithm proceeds in four stages: sampling, partitioning, bucket sorting, and concatenation.
gpu_sample_sort(A, p):
select sample keys from A
sort the samples
choose p - 1 splitters
parallel assign each element to a bucket using splitters
compute bucket sizes using parallel prefix sums
scatter elements into bucket regions
parallel sort each bucket
return concatenation of bucketsSplitters define bucket boundaries:
Each element belongs to bucket such that:
Bucket Assignment
Bucket index is computed using binary search over splitters.
bucket_index(x, splitters):
return upper_bound(splitters, x)On GPU, this is often implemented using warp wide binary search or small linear scans for performance.
Prefix Sum for Placement
After counting bucket sizes, compute offsets:
Each bucket occupies a contiguous region in the output array.
Complexity
| measure | value |
|---|---|
| sampling | |
| partitioning | |
| bucket sorting | expected |
| total work | |
| parallel time |
Balance depends on splitter quality.
Correctness
Splitters divide the input domain into ordered intervals. All elements in earlier buckets are less than or equal to elements in later buckets. Sorting each bucket produces locally sorted segments. Concatenating buckets in order yields a globally sorted array.
Practical Considerations
- Oversampling improves bucket balance.
- Use shared memory buffers for local partitioning.
- Avoid contention when writing bucket sizes by using per block counts.
- Load imbalance occurs if data distribution is skewed.
- Bucket sorting can use radix sort or bitonic sort depending on size.
- Communication between blocks must be minimized.
When to Use
Use GPU sample sort when:
- input is very large
- multiple GPU blocks can be used effectively
- data distribution can be approximated by sampling
- high throughput sorting is required
Avoid it when:
- input is small
- distribution is highly skewed
- simpler GPU radix sort is applicable
Implementation Sketch
__global__ void assign_buckets(
const int *in,
int *bucket_idx,
const int *splitters,
int p,
int n
) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n) return;
int x = in[i];
int lo = 0, hi = p - 1;
while (lo < hi) {
int mid = (lo + hi) / 2;
if (x <= splitters[mid]) {
hi = mid;
} else {
lo = mid + 1;
}
}
bucket_idx[i] = lo;
}__global__ void scatter_to_buckets(
const int *in,
int *out,
const int *bucket_idx,
int *offsets,
int n
) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n) return;
int b = bucket_idx[i];
int pos = atomicAdd(&offsets[b], 1);
out[pos] = in[i];
}