CUDA block sort sorts a tile of data using all threads in a block. It extends warp level sorting to larger sizes by using shared memory and synchronization within a thread block.
Each block loads a chunk of the array into shared memory, sorts it cooperatively, then writes it back. This is often used as a building block for full GPU sorting algorithms such as merge sort or radix sort.
Problem
Given a tile of elements assigned to one CUDA block, sort the tile in nondecreasing order.
Algorithm
A typical implementation uses a sorting network such as bitonic sort over shared memory.
cuda_block_sort(A_block):
load A_block into shared memory S
for size = 2; size <= m; size *= 2:
for stride = size / 2; stride > 0; stride /= 2:
synchronize threads
for each thread i in parallel:
j = i xor stride
if j > i:
ascending = (i & size) == 0
compare_exchange(S[i], S[j], ascending)
synchronize threads
write S back to global memoryAll threads cooperate on the same shared memory array.
Shared Memory Layout
The array is stored in shared memory:
shared int S[m]Each thread typically handles one or more elements.
Compare and Exchange
compare_exchange(x, y, ascending):
if ascending and x > y:
swap x, y
if not ascending and x < y:
swap x, yBecause shared memory is visible to all threads in the block, synchronization is required between stages.
Complexity
| measure | value |
|---|---|
| tile size | |
| comparisons | |
| parallel depth | |
| memory | shared memory |
The tile size is limited by shared memory capacity and block size.
Correctness
The sorting network ensures that after each stage, elements move toward their correct positions. The bitonic construction and merge guarantee that the final arrangement is sorted.
Since all threads operate on the same shared array with synchronization at each stage, no race conditions occur.
Practical Considerations
- Shared memory bandwidth is much higher than global memory.
- Synchronization via
__syncthreads()is required between stages. - Bank conflicts should be minimized by proper memory layout.
- Tile size is constrained by shared memory limits.
- Often combined with global merge or radix passes for full array sorting.
- Each thread may handle multiple elements for better occupancy.
When to Use
Use CUDA block sort when:
- sorting moderate sized chunks on GPU
- building a hierarchical GPU sorting algorithm
- minimizing global memory traffic
- high speed shared memory access is beneficial
Avoid using it alone for very large arrays. It is designed as a local sorting primitive.
Implementation Sketch
__global__ void block_bitonic_sort(int *data, int n) {
extern __shared__ int S[];
int tid = threadIdx.x;
int base = blockIdx.x * blockDim.x;
if (base + tid < n) {
S[tid] = data[base + tid];
}
__syncthreads();
for (int size = 2; size <= blockDim.x; size <<= 1) {
for (int stride = size >> 1; stride > 0; stride >>= 1) {
int j = tid ^ stride;
if (j > tid) {
bool ascending = (tid & size) == 0;
int x = S[tid];
int y = S[j];
if ((ascending && x > y) || (!ascending && x < y)) {
S[tid] = y;
S[j] = x;
}
}
__syncthreads();
}
}
if (base + tid < n) {
data[base + tid] = S[tid];
}
}void launch_block_sort(int *d_data, int n) {
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
block_bitonic_sort<<<gridSize, blockSize, blockSize * sizeof(int)>>>(d_data, n);
}