GPU bitonic sort is a GPU implementation of bitonic sort. It uses a fixed compare and exchange pattern that fits SIMD style execution well. The algorithm performs more comparisons than quicksort or merge sort, but its memory access and control flow are regular.
This makes it useful for small to medium arrays, block local sorting, graphics workloads, and GPU kernels where predictable execution is more important than asymptotically minimal comparison count.
Problem
Given an array of length , sort it on a GPU in nondecreasing order.
For the standard network form, assume is a power of two.
Algorithm
The algorithm repeatedly compares pairs of elements whose indices differ by a fixed stride. The stride and direction are determined by the current network stage.
gpu_bitonic_sort(A):
for size = 2; size <= n; size *= 2:
for stride = size / 2; stride > 0; stride /= 2:
parallel for i from 0 to n - 1:
j = i xor stride
if j > i:
ascending = (i & size) == 0
if ascending and A[i] > A[j]:
swap A[i], A[j]
if not ascending and A[i] < A[j]:
swap A[i], A[j]The expression
j = i xor strideselects the partner index for the current compare and exchange stage.
Compare Direction
For each subsequence of length size, one half is arranged ascending and the other descending during construction. The condition
ascending = (i & size) == 0chooses the compare direction for each region.
Complexity
| measure | value |
|---|---|
| comparisons | |
| stages | |
| extra space | in place |
| control flow | data independent |
The GPU kernel count is commonly proportional to the number of stages, unless several stages are fused inside one kernel.
Correctness
Bitonic sort builds bitonic sequences and repeatedly applies bitonic merge. Each compare and exchange stage moves smaller elements toward the ascending side and larger elements toward the descending side for the current subsequence. The recursive structure ensures that each subsequence becomes sorted. After the final stage, the whole array is sorted.
Practical Considerations
- Best performance comes from sorting tiles in shared memory.
- Global memory traffic is expensive, so kernels often load a block, sort locally, then write back.
- Power of two sizes are easiest; other sizes require padding or boundary checks.
- The fixed schedule avoids branch divergence caused by input dependent partitioning.
- For very large arrays, GPU radix sort is usually faster for integer keys.
- Bitonic sort remains useful as a small sort primitive inside larger GPU algorithms.
When to Use
Use GPU bitonic sort when:
- sorting small to medium batches
- sorting inside GPU blocks
- keys are already resident on the GPU
- branch regularity matters
- a sorting network is useful
Avoid it for very large integer arrays when GPU radix sort is available.
Implementation Sketch
__global__ void bitonic_stage(int *a, int n, int stride, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n) return;
int j = i ^ stride;
if (j > i && j < n) {
bool ascending = (i & size) == 0;
int x = a[i];
int y = a[j];
if ((ascending && x > y) || (!ascending && x < y)) {
a[i] = y;
a[j] = x;
}
}
}void gpu_bitonic_sort(int *a, int n) {
for (int size = 2; size <= n; size <<= 1) {
for (int stride = size >> 1; stride > 0; stride >>= 1) {
bitonic_stage<<<grid, block>>>(a, n, stride, size);
}
}
}