Sort data on a GPU by recursively merging sorted runs using parallel merge primitives.
GPU merge sort adapts merge sort to GPU execution by replacing recursive merging with parallel merge kernels. The algorithm operates in passes, merging sorted runs of increasing size.
Unlike GPU radix sort, this method works for arbitrary comparable keys. Its performance depends on efficient parallel merge and memory throughput.
Problem
Given an array of size stored on a GPU, sort it in nondecreasing order.
Algorithm
Instead of recursive calls, use iterative bottom up merging. Each pass doubles the run size.
gpu_merge_sort(A):
B = new array of length A
for width = 1; width < n; width *= 2:
parallel for i = 0 to n - 1 step 2 * width:
merge A[i : i+width] and A[i+width : i+2*width] into B
swap A and B
return AEach pass merges adjacent sorted segments of length width.
Parallel Merge
A naive merge is sequential. GPU merge splits work using partitioning.
parallel_merge(A, B, C):
partition output range into chunks
for each chunk:
find corresponding split points in A and B using binary search
merge subranges independentlyEach thread block merges a portion of the output. Partitioning ensures balanced work.
Merge Path Idea
For two sorted arrays and , the merged output lies on a grid. A diagonal partition splits the merge into independent regions.
Given diagonal index , find indices and such that:
and
These conditions define a valid partition point.
Complexity
| measure | value |
|---|---|
| passes | |
| work per pass | |
| total work | |
| parallel time | |
| extra space |
Performance depends on efficient partitioning and memory access.
Correctness
Each pass merges pairs of sorted runs into larger sorted runs. Since merging preserves order, and each pass doubles run size, after passes the entire array is sorted.
Parallel partitioning divides merge work without altering merge correctness because each partition corresponds to a disjoint portion of the output.
Practical Considerations
- Memory bandwidth dominates performance.
- Coalesced memory access is critical.
- Partitioning must balance load across blocks.
- Small runs can be sorted using shared memory before global merging.
- Requires auxiliary array of size .
- Often combined with block level sorting primitives.
When to Use
Use GPU merge sort when:
- keys require custom comparison
- stability is required
- data size is large
- radix sort is not applicable
Avoid it when integer keys allow faster radix sorting or when data transfer cost dominates.
Implementation Sketch
__global__ void merge_pass(
const int *in,
int *out,
int n,
int width
) {
int start = (blockIdx.x * blockDim.x + threadIdx.x) * (2 * width);
if (start >= n) return;
int mid = min(start + width, n);
int end = min(start + 2 * width, n);
int i = start;
int j = mid;
int k = start;
while (i < mid && j < end) {
if (in[i] <= in[j]) {
out[k++] = in[i++];
} else {
out[k++] = in[j++];
}
}
while (i < mid) out[k++] = in[i++];
while (j < end) out[k++] = in[j++];
}void gpu_merge_sort(int *d_a, int n) {
int *d_b;
cudaMalloc(&d_b, n * sizeof(int));
for (int width = 1; width < n; width <<= 1) {
int blocks = (n + (2 * width) - 1) / (2 * width);
merge_pass<<<blocks, 1>>>(d_a, d_b, n, width);
int *tmp = d_a;
d_a = d_b;
d_b = tmp;
}
cudaFree(d_b);
}