Sort a small set of values inside one CUDA warp using shuffle operations and compare exchange steps.
CUDA warp sort sorts a small group of values within one warp. A warp is a fixed group of GPU threads that execute together. On NVIDIA GPUs, a warp commonly contains 32 lanes.
The algorithm is usually implemented with compare and exchange steps using warp shuffle instructions. Since all lanes execute the same instruction stream, warp sort works best with regular sorting networks such as bitonic sort or odd even merge sort.
Problem
Given one key per lane in a warp, sort the warp’s keys in nondecreasing order.
For the standard version, assume the warp has 32 active lanes.
Algorithm
A common implementation uses a bitonic sorting network. Each lane exchanges values with another lane chosen by XOR.
cuda_warp_sort(x, lane):
for size = 2; size <= warp_size; size *= 2:
for stride = size / 2; stride > 0; stride /= 2:
y = shuffle_xor(x, stride)
ascending = (lane & size) == 0
if should_keep_other(x, y, lane, stride, ascending):
x = y
return xThe partner lane is:
Each lane keeps either the smaller or larger value depending on the sorting direction and its position in the network.
Compare and Exchange
warp_compare_exchange(x, y, lane, stride, ascending):
lower_lane = (lane & stride) == 0
if ascending:
if lower_lane:
return min(x, y)
return max(x, y)
if lower_lane:
return max(x, y)
return min(x, y)No shared memory is required for the basic version. The values move between lanes through shuffle operations.
Complexity
| measure | value |
|---|---|
| values sorted | one warp, usually 32 |
| compare exchange stages | |
| extra memory | per lane |
| synchronization | implicit within warp |
Here is the warp size.
For , the number of network stages is small and fixed.
Correctness
The bitonic network first forms bitonic subsequences, then merges them into sorted subsequences. At each compare and exchange step, the smaller and larger values move to the correct side of the current subsequence. After all stages complete, the whole warp is sorted.
Because the network schedule is fixed, correctness does not depend on the input distribution.
Practical Considerations
- Works well for 32 or fewer values.
- Avoids global memory and shared memory for the core sort.
- Useful as a building block inside larger GPU sorts.
- Requires all participating lanes to follow the same control flow.
- Inactive lanes need sentinel values or masks.
- For key value pairs, shuffle both key and payload together.
When to Use
Use CUDA warp sort when:
- sorting one small tile per warp
- building a larger GPU sorting algorithm
- local rank ordering is needed inside a kernel
- keys already live in registers
- low latency matters
Avoid it for large arrays by itself. It is a primitive, not a complete full array sorting strategy.
Implementation Sketch
__device__ int warp_bitonic_sort(int x) {
int lane = threadIdx.x & 31;
for (int size = 2; size <= 32; size <<= 1) {
for (int stride = size >> 1; stride > 0; stride >>= 1) {
int y = __shfl_xor_sync(0xffffffff, x, stride);
bool ascending = (lane & size) == 0;
bool lower_lane = (lane & stride) == 0;
if (ascending) {
x = lower_lane ? min(x, y) : max(x, y);
} else {
x = lower_lane ? max(x, y) : min(x, y);
}
}
}
return x;
}__global__ void sort_one_warp_per_block(const int *in, int *out) {
int lane = threadIdx.x & 31;
int block = blockIdx.x;
int x = in[block * 32 + lane];
x = warp_bitonic_sort(x);
out[block * 32 + lane] = x;
}