# CUDA Warp Sort

# CUDA Warp Sort

CUDA warp sort sorts a small group of values within one warp. A warp is a fixed group of GPU threads that execute together. On NVIDIA GPUs, a warp commonly contains 32 lanes.

The algorithm is usually implemented with compare and exchange steps using warp shuffle instructions. Since all lanes execute the same instruction stream, warp sort works best with regular sorting networks such as bitonic sort or odd even merge sort.

## Problem

Given one key per lane in a warp, sort the warp's keys in nondecreasing order.

For the standard version, assume the warp has 32 active lanes.

## Algorithm

A common implementation uses a bitonic sorting network. Each lane exchanges values with another lane chosen by XOR.

```text id="cckg8v"
cuda_warp_sort(x, lane):
    for size = 2; size <= warp_size; size *= 2:
        for stride = size / 2; stride > 0; stride /= 2:
            y = shuffle_xor(x, stride)

            ascending = (lane & size) == 0

            if should_keep_other(x, y, lane, stride, ascending):
                x = y

    return x
```

The partner lane is:

$$
partner = lane \oplus stride
$$

Each lane keeps either the smaller or larger value depending on the sorting direction and its position in the network.

## Compare and Exchange

```text id="vh47vq"
warp_compare_exchange(x, y, lane, stride, ascending):
    lower_lane = (lane & stride) == 0

    if ascending:
        if lower_lane:
            return min(x, y)
        return max(x, y)

    if lower_lane:
        return max(x, y)
    return min(x, y)
```

No shared memory is required for the basic version. The values move between lanes through shuffle operations.

## Complexity

| measure                 | value                |
| ----------------------- | -------------------- |
| values sorted           | one warp, usually 32 |
| compare exchange stages | $O(\log^2 W)$        |
| extra memory            | $O(1)$ per lane      |
| synchronization         | implicit within warp |

Here $W$ is the warp size.

For $W = 32$, the number of network stages is small and fixed.

## Correctness

The bitonic network first forms bitonic subsequences, then merges them into sorted subsequences. At each compare and exchange step, the smaller and larger values move to the correct side of the current subsequence. After all stages complete, the whole warp is sorted.

Because the network schedule is fixed, correctness does not depend on the input distribution.

## Practical Considerations

* Works well for 32 or fewer values.
* Avoids global memory and shared memory for the core sort.
* Useful as a building block inside larger GPU sorts.
* Requires all participating lanes to follow the same control flow.
* Inactive lanes need sentinel values or masks.
* For key value pairs, shuffle both key and payload together.

## When to Use

Use CUDA warp sort when:

* sorting one small tile per warp
* building a larger GPU sorting algorithm
* local rank ordering is needed inside a kernel
* keys already live in registers
* low latency matters

Avoid it for large arrays by itself. It is a primitive, not a complete full array sorting strategy.

## Implementation Sketch

```cuda id="v749sx"
__device__ int warp_bitonic_sort(int x) {
    int lane = threadIdx.x & 31;

    for (int size = 2; size <= 32; size <<= 1) {
        for (int stride = size >> 1; stride > 0; stride >>= 1) {
            int y = __shfl_xor_sync(0xffffffff, x, stride);

            bool ascending = (lane & size) == 0;
            bool lower_lane = (lane & stride) == 0;

            if (ascending) {
                x = lower_lane ? min(x, y) : max(x, y);
            } else {
                x = lower_lane ? max(x, y) : min(x, y);
            }
        }
    }

    return x;
}
```

```cuda id="eqhebs"
__global__ void sort_one_warp_per_block(const int *in, int *out) {
    int lane = threadIdx.x & 31;
    int block = blockIdx.x;

    int x = in[block * 32 + lane];
    x = warp_bitonic_sort(x);

    out[block * 32 + lane] = x;
}
```

