# CUDA Block Sort

# CUDA Block Sort

CUDA block sort sorts a tile of data using all threads in a block. It extends warp level sorting to larger sizes by using shared memory and synchronization within a thread block.

Each block loads a chunk of the array into shared memory, sorts it cooperatively, then writes it back. This is often used as a building block for full GPU sorting algorithms such as merge sort or radix sort.

## Problem

Given a tile of $m$ elements assigned to one CUDA block, sort the tile in nondecreasing order.

## Algorithm

A typical implementation uses a sorting network such as bitonic sort over shared memory.

```text id="2f7t9c"
cuda_block_sort(A_block):
    load A_block into shared memory S

    for size = 2; size <= m; size *= 2:
        for stride = size / 2; stride > 0; stride /= 2:
            synchronize threads

            for each thread i in parallel:
                j = i xor stride

                if j > i:
                    ascending = (i & size) == 0

                    compare_exchange(S[i], S[j], ascending)

    synchronize threads
    write S back to global memory
```

All threads cooperate on the same shared memory array.

## Shared Memory Layout

The array is stored in shared memory:

```text id="3sbn1u"
shared int S[m]
```

Each thread typically handles one or more elements.

## Compare and Exchange

```text id="2bpf54"
compare_exchange(x, y, ascending):
    if ascending and x > y:
        swap x, y

    if not ascending and x < y:
        swap x, y
```

Because shared memory is visible to all threads in the block, synchronization is required between stages.

## Complexity

| measure        | value                |
| -------------- | -------------------- |
| tile size      | $m$                  |
| comparisons    | $O(m \log^2 m)$      |
| parallel depth | $O(\log^2 m)$        |
| memory         | $O(m)$ shared memory |

The tile size $m$ is limited by shared memory capacity and block size.

## Correctness

The sorting network ensures that after each stage, elements move toward their correct positions. The bitonic construction and merge guarantee that the final arrangement is sorted.

Since all threads operate on the same shared array with synchronization at each stage, no race conditions occur.

## Practical Considerations

* Shared memory bandwidth is much higher than global memory.
* Synchronization via `__syncthreads()` is required between stages.
* Bank conflicts should be minimized by proper memory layout.
* Tile size is constrained by shared memory limits.
* Often combined with global merge or radix passes for full array sorting.
* Each thread may handle multiple elements for better occupancy.

## When to Use

Use CUDA block sort when:

* sorting moderate sized chunks on GPU
* building a hierarchical GPU sorting algorithm
* minimizing global memory traffic
* high speed shared memory access is beneficial

Avoid using it alone for very large arrays. It is designed as a local sorting primitive.

## Implementation Sketch

```cuda id="6z0xk2"
__global__ void block_bitonic_sort(int *data, int n) {
    extern __shared__ int S[];

    int tid = threadIdx.x;
    int base = blockIdx.x * blockDim.x;

    if (base + tid < n) {
        S[tid] = data[base + tid];
    }

    __syncthreads();

    for (int size = 2; size <= blockDim.x; size <<= 1) {
        for (int stride = size >> 1; stride > 0; stride >>= 1) {

            int j = tid ^ stride;

            if (j > tid) {
                bool ascending = (tid & size) == 0;

                int x = S[tid];
                int y = S[j];

                if ((ascending && x > y) || (!ascending && x < y)) {
                    S[tid] = y;
                    S[j] = x;
                }
            }

            __syncthreads();
        }
    }

    if (base + tid < n) {
        data[base + tid] = S[tid];
    }
}
```

```cuda id="5nv9qp"
void launch_block_sort(int *d_data, int n) {
    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;

    block_bitonic_sort<<<gridSize, blockSize, blockSize * sizeof(int)>>>(d_data, n);
}
```