GPU Bitonic Sort

GPU bitonic sort is a GPU implementation of bitonic sort. It uses a fixed compare and exchange pattern that fits SIMD style execution well. The algorithm performs more comparisons than quicksort or merge sort, but its memory access and control flow are regular.

This makes it useful for small to medium arrays, block local sorting, graphics workloads, and GPU kernels where predictable execution is more important than asymptotically minimal comparison count.

Problem

Given an array $A$ of length $n$, sort it on a GPU in nondecreasing order.

For the standard network form, assume $n$ is a power of two.

Algorithm

The algorithm repeatedly compares pairs of elements whose indices differ by a fixed stride. The stride and direction are determined by the current network stage.

gpu_bitonic_sort(A):
    for size = 2; size <= n; size *= 2:
        for stride = size / 2; stride > 0; stride /= 2:
            parallel for i from 0 to n - 1:
                j = i xor stride

                if j > i:
                    ascending = (i & size) == 0

                    if ascending and A[i] > A[j]:
                        swap A[i], A[j]

                    if not ascending and A[i] < A[j]:
                        swap A[i], A[j]

The expression

j = i xor stride

selects the partner index for the current compare and exchange stage.

Compare Direction

For each subsequence of length size, one half is arranged ascending and the other descending during construction. The condition

ascending = (i & size) == 0

chooses the compare direction for each region.

Complexity

measure	value
comparisons	$O(n \log^2 n)$
stages	$O(\log^2 n)$
extra space	$O(1)$ in place
control flow	data independent

The GPU kernel count is commonly proportional to the number of stages, unless several stages are fused inside one kernel.

Correctness

Bitonic sort builds bitonic sequences and repeatedly applies bitonic merge. Each compare and exchange stage moves smaller elements toward the ascending side and larger elements toward the descending side for the current subsequence. The recursive structure ensures that each subsequence becomes sorted. After the final stage, the whole array is sorted.

Practical Considerations

Best performance comes from sorting tiles in shared memory.
Global memory traffic is expensive, so kernels often load a block, sort locally, then write back.
Power of two sizes are easiest; other sizes require padding or boundary checks.
The fixed schedule avoids branch divergence caused by input dependent partitioning.
For very large arrays, GPU radix sort is usually faster for integer keys.
Bitonic sort remains useful as a small sort primitive inside larger GPU algorithms.

When to Use

Use GPU bitonic sort when:

sorting small to medium batches
sorting inside GPU blocks
keys are already resident on the GPU
branch regularity matters
a sorting network is useful

Avoid it for very large integer arrays when GPU radix sort is available.

Implementation Sketch

__global__ void bitonic_stage(int *a, int n, int stride, int size) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    int j = i ^ stride;

    if (j > i && j < n) {
        bool ascending = (i & size) == 0;

        int x = a[i];
        int y = a[j];

        if ((ascending && x > y) || (!ascending && x < y)) {
            a[i] = y;
            a[j] = x;
        }
    }
}

void gpu_bitonic_sort(int *a, int n) {
    for (int size = 2; size <= n; size <<= 1) {
        for (int stride = size >> 1; stride > 0; stride >>= 1) {
            bitonic_stage<<<grid, block>>>(a, n, stride, size);
        }
    }
}