CUDA Block Sort

CUDA block sort sorts a tile of data using all threads in a block. It extends warp level sorting to larger sizes by using shared memory and synchronization within a thread block.

Each block loads a chunk of the array into shared memory, sorts it cooperatively, then writes it back. This is often used as a building block for full GPU sorting algorithms such as merge sort or radix sort.

Problem

Given a tile of $m$ elements assigned to one CUDA block, sort the tile in nondecreasing order.

Algorithm

A typical implementation uses a sorting network such as bitonic sort over shared memory.

cuda_block_sort(A_block):
    load A_block into shared memory S

    for size = 2; size <= m; size *= 2:
        for stride = size / 2; stride > 0; stride /= 2:
            synchronize threads

            for each thread i in parallel:
                j = i xor stride

                if j > i:
                    ascending = (i & size) == 0

                    compare_exchange(S[i], S[j], ascending)

    synchronize threads
    write S back to global memory

All threads cooperate on the same shared memory array.

Shared Memory Layout

The array is stored in shared memory:

shared int S[m]

Each thread typically handles one or more elements.

Compare and Exchange

compare_exchange(x, y, ascending):
    if ascending and x > y:
        swap x, y

    if not ascending and x < y:
        swap x, y

Because shared memory is visible to all threads in the block, synchronization is required between stages.

Complexity

measure	value
tile size	$m$
comparisons	$O(m \log^2 m)$
parallel depth	$O(\log^2 m)$
memory	$O(m)$ shared memory

The tile size $m$ is limited by shared memory capacity and block size.

Correctness

The sorting network ensures that after each stage, elements move toward their correct positions. The bitonic construction and merge guarantee that the final arrangement is sorted.

Since all threads operate on the same shared array with synchronization at each stage, no race conditions occur.

Practical Considerations

Shared memory bandwidth is much higher than global memory.
Synchronization via __syncthreads() is required between stages.
Bank conflicts should be minimized by proper memory layout.
Tile size is constrained by shared memory limits.
Often combined with global merge or radix passes for full array sorting.
Each thread may handle multiple elements for better occupancy.

When to Use

Use CUDA block sort when:

sorting moderate sized chunks on GPU
building a hierarchical GPU sorting algorithm
minimizing global memory traffic
high speed shared memory access is beneficial

Avoid using it alone for very large arrays. It is designed as a local sorting primitive.

Implementation Sketch

__global__ void block_bitonic_sort(int *data, int n) {
    extern __shared__ int S[];

    int tid = threadIdx.x;
    int base = blockIdx.x * blockDim.x;

    if (base + tid < n) {
        S[tid] = data[base + tid];
    }

    __syncthreads();

    for (int size = 2; size <= blockDim.x; size <<= 1) {
        for (int stride = size >> 1; stride > 0; stride >>= 1) {

            int j = tid ^ stride;

            if (j > tid) {
                bool ascending = (tid & size) == 0;

                int x = S[tid];
                int y = S[j];

                if ((ascending && x > y) || (!ascending && x < y)) {
                    S[tid] = y;
                    S[j] = x;
                }
            }

            __syncthreads();
        }
    }

    if (base + tid < n) {
        data[base + tid] = S[tid];
    }
}

void launch_block_sort(int *d_data, int n) {
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;

    block_bitonic_sort<<<gridSize, blockSize, blockSize * sizeof(int)>>>(d_data, n);
}