Skip to content

GPU Merge Sort

Sort data on a GPU by recursively merging sorted runs using parallel merge primitives.

GPU merge sort adapts merge sort to GPU execution by replacing recursive merging with parallel merge kernels. The algorithm operates in passes, merging sorted runs of increasing size.

Unlike GPU radix sort, this method works for arbitrary comparable keys. Its performance depends on efficient parallel merge and memory throughput.

Problem

Given an array AA of size nn stored on a GPU, sort it in nondecreasing order.

Algorithm

Instead of recursive calls, use iterative bottom up merging. Each pass doubles the run size.

gpu_merge_sort(A):
    B = new array of length A

    for width = 1; width < n; width *= 2:
        parallel for i = 0 to n - 1 step 2 * width:
            merge A[i : i+width] and A[i+width : i+2*width] into B

        swap A and B

    return A

Each pass merges adjacent sorted segments of length width.

Parallel Merge

A naive merge is sequential. GPU merge splits work using partitioning.

parallel_merge(A, B, C):
    partition output range into chunks

    for each chunk:
        find corresponding split points in A and B using binary search
        merge subranges independently

Each thread block merges a portion of the output. Partitioning ensures balanced work.

Merge Path Idea

For two sorted arrays AA and BB, the merged output lies on a grid. A diagonal partition splits the merge into independent regions.

Given diagonal index dd, find indices ii and jj such that:

i+j=d i + j = d

and

A[i1]B[j]andB[j1]A[i] A[i - 1] \le B[j] \quad \text{and} \quad B[j - 1] \le A[i]

These conditions define a valid partition point.

Complexity

measurevalue
passesO(logn)O(\log n)
work per passO(n)O(n)
total workO(nlogn)O(n \log n)
parallel timeO((n/p)logn)O((n/p)\log n)
extra spaceO(n)O(n)

Performance depends on efficient partitioning and memory access.

Correctness

Each pass merges pairs of sorted runs into larger sorted runs. Since merging preserves order, and each pass doubles run size, after logn\log n passes the entire array is sorted.

Parallel partitioning divides merge work without altering merge correctness because each partition corresponds to a disjoint portion of the output.

Practical Considerations

  • Memory bandwidth dominates performance.
  • Coalesced memory access is critical.
  • Partitioning must balance load across blocks.
  • Small runs can be sorted using shared memory before global merging.
  • Requires auxiliary array of size nn.
  • Often combined with block level sorting primitives.

When to Use

Use GPU merge sort when:

  • keys require custom comparison
  • stability is required
  • data size is large
  • radix sort is not applicable

Avoid it when integer keys allow faster radix sorting or when data transfer cost dominates.

Implementation Sketch

__global__ void merge_pass(
    const int *in,
    int *out,
    int n,
    int width
) {
    int start = (blockIdx.x * blockDim.x + threadIdx.x) * (2 * width);

    if (start >= n) return;

    int mid = min(start + width, n);
    int end = min(start + 2 * width, n);

    int i = start;
    int j = mid;
    int k = start;

    while (i < mid && j < end) {
        if (in[i] <= in[j]) {
            out[k++] = in[i++];
        } else {
            out[k++] = in[j++];
        }
    }

    while (i < mid) out[k++] = in[i++];
    while (j < end) out[k++] = in[j++];
}
void gpu_merge_sort(int *d_a, int n) {
    int *d_b;
    cudaMalloc(&d_b, n * sizeof(int));

    for (int width = 1; width < n; width <<= 1) {
        int blocks = (n + (2 * width) - 1) / (2 * width);
        merge_pass<<<blocks, 1>>>(d_a, d_b, n, width);

        int *tmp = d_a;
        d_a = d_b;
        d_b = tmp;
    }

    cudaFree(d_b);
}