Skip to content

CUDA Warp Sort

Sort a small set of values inside one CUDA warp using shuffle operations and compare exchange steps.

CUDA warp sort sorts a small group of values within one warp. A warp is a fixed group of GPU threads that execute together. On NVIDIA GPUs, a warp commonly contains 32 lanes.

The algorithm is usually implemented with compare and exchange steps using warp shuffle instructions. Since all lanes execute the same instruction stream, warp sort works best with regular sorting networks such as bitonic sort or odd even merge sort.

Problem

Given one key per lane in a warp, sort the warp’s keys in nondecreasing order.

For the standard version, assume the warp has 32 active lanes.

Algorithm

A common implementation uses a bitonic sorting network. Each lane exchanges values with another lane chosen by XOR.

cuda_warp_sort(x, lane):
    for size = 2; size <= warp_size; size *= 2:
        for stride = size / 2; stride > 0; stride /= 2:
            y = shuffle_xor(x, stride)

            ascending = (lane & size) == 0

            if should_keep_other(x, y, lane, stride, ascending):
                x = y

    return x

The partner lane is:

partner=lanestride partner = lane \oplus stride

Each lane keeps either the smaller or larger value depending on the sorting direction and its position in the network.

Compare and Exchange

warp_compare_exchange(x, y, lane, stride, ascending):
    lower_lane = (lane & stride) == 0

    if ascending:
        if lower_lane:
            return min(x, y)
        return max(x, y)

    if lower_lane:
        return max(x, y)
    return min(x, y)

No shared memory is required for the basic version. The values move between lanes through shuffle operations.

Complexity

measurevalue
values sortedone warp, usually 32
compare exchange stagesO(log2W)O(\log^2 W)
extra memoryO(1)O(1) per lane
synchronizationimplicit within warp

Here WW is the warp size.

For W=32W = 32, the number of network stages is small and fixed.

Correctness

The bitonic network first forms bitonic subsequences, then merges them into sorted subsequences. At each compare and exchange step, the smaller and larger values move to the correct side of the current subsequence. After all stages complete, the whole warp is sorted.

Because the network schedule is fixed, correctness does not depend on the input distribution.

Practical Considerations

  • Works well for 32 or fewer values.
  • Avoids global memory and shared memory for the core sort.
  • Useful as a building block inside larger GPU sorts.
  • Requires all participating lanes to follow the same control flow.
  • Inactive lanes need sentinel values or masks.
  • For key value pairs, shuffle both key and payload together.

When to Use

Use CUDA warp sort when:

  • sorting one small tile per warp
  • building a larger GPU sorting algorithm
  • local rank ordering is needed inside a kernel
  • keys already live in registers
  • low latency matters

Avoid it for large arrays by itself. It is a primitive, not a complete full array sorting strategy.

Implementation Sketch

__device__ int warp_bitonic_sort(int x) {
    int lane = threadIdx.x & 31;

    for (int size = 2; size <= 32; size <<= 1) {
        for (int stride = size >> 1; stride > 0; stride >>= 1) {
            int y = __shfl_xor_sync(0xffffffff, x, stride);

            bool ascending = (lane & size) == 0;
            bool lower_lane = (lane & stride) == 0;

            if (ascending) {
                x = lower_lane ? min(x, y) : max(x, y);
            } else {
                x = lower_lane ? max(x, y) : min(x, y);
            }
        }
    }

    return x;
}
__global__ void sort_one_warp_per_block(const int *in, int *out) {
    int lane = threadIdx.x & 31;
    int block = blockIdx.x;

    int x = in[block * 32 + lane];
    x = warp_bitonic_sort(x);

    out[block * 32 + lane] = x;
}