CUDA Warp Sort

CUDA warp sort sorts a small group of values within one warp. A warp is a fixed group of GPU threads that execute together. On NVIDIA GPUs, a warp commonly contains 32 lanes.

The algorithm is usually implemented with compare and exchange steps using warp shuffle instructions. Since all lanes execute the same instruction stream, warp sort works best with regular sorting networks such as bitonic sort or odd even merge sort.

Problem

Given one key per lane in a warp, sort the warp’s keys in nondecreasing order.

For the standard version, assume the warp has 32 active lanes.

Algorithm

A common implementation uses a bitonic sorting network. Each lane exchanges values with another lane chosen by XOR.

cuda_warp_sort(x, lane):
    for size = 2; size <= warp_size; size *= 2:
        for stride = size / 2; stride > 0; stride /= 2:
            y = shuffle_xor(x, stride)

            ascending = (lane & size) == 0

            if should_keep_other(x, y, lane, stride, ascending):
                x = y

    return x

The partner lane is:

partner = lane \oplus stride

Each lane keeps either the smaller or larger value depending on the sorting direction and its position in the network.

Compare and Exchange

warp_compare_exchange(x, y, lane, stride, ascending):
    lower_lane = (lane & stride) == 0

    if ascending:
        if lower_lane:
            return min(x, y)
        return max(x, y)

    if lower_lane:
        return max(x, y)
    return min(x, y)

No shared memory is required for the basic version. The values move between lanes through shuffle operations.

Complexity

measure	value
values sorted	one warp, usually 32
compare exchange stages	$O(\log^2 W)$
extra memory	$O(1)$ per lane
synchronization	implicit within warp

Here $W$ is the warp size.

For $W = 32$ , the number of network stages is small and fixed.

Correctness

The bitonic network first forms bitonic subsequences, then merges them into sorted subsequences. At each compare and exchange step, the smaller and larger values move to the correct side of the current subsequence. After all stages complete, the whole warp is sorted.

Because the network schedule is fixed, correctness does not depend on the input distribution.

Practical Considerations

Works well for 32 or fewer values.
Avoids global memory and shared memory for the core sort.
Useful as a building block inside larger GPU sorts.
Requires all participating lanes to follow the same control flow.
Inactive lanes need sentinel values or masks.
For key value pairs, shuffle both key and payload together.

When to Use

Use CUDA warp sort when:

sorting one small tile per warp
building a larger GPU sorting algorithm
local rank ordering is needed inside a kernel
keys already live in registers
low latency matters

Avoid it for large arrays by itself. It is a primitive, not a complete full array sorting strategy.

Implementation Sketch

__device__ int warp_bitonic_sort(int x) {
    int lane = threadIdx.x & 31;

    for (int size = 2; size <= 32; size <<= 1) {
        for (int stride = size >> 1; stride > 0; stride >>= 1) {
            int y = __shfl_xor_sync(0xffffffff, x, stride);

            bool ascending = (lane & size) == 0;
            bool lower_lane = (lane & stride) == 0;

            if (ascending) {
                x = lower_lane ? min(x, y) : max(x, y);
            } else {
                x = lower_lane ? max(x, y) : min(x, y);
            }
        }
    }

    return x;
}

__global__ void sort_one_warp_per_block(const int *in, int *out) {
    int lane = threadIdx.x & 31;
    int block = blockIdx.x;

    int x = in[block * 32 + lane];
    x = warp_bitonic_sort(x);

    out[block * 32 + lane] = x;
}