# GPU Merge Sort

# GPU Merge Sort

GPU merge sort adapts merge sort to GPU execution by replacing recursive merging with parallel merge kernels. The algorithm operates in passes, merging sorted runs of increasing size.

Unlike GPU radix sort, this method works for arbitrary comparable keys. Its performance depends on efficient parallel merge and memory throughput.

## Problem

Given an array $A$ of size $n$ stored on a GPU, sort it in nondecreasing order.

## Algorithm

Instead of recursive calls, use iterative bottom up merging. Each pass doubles the run size.

```text id="x2c9rk"
gpu_merge_sort(A):
    B = new array of length A

    for width = 1; width < n; width *= 2:
        parallel for i = 0 to n - 1 step 2 * width:
            merge A[i : i+width] and A[i+width : i+2*width] into B

        swap A and B

    return A
```

Each pass merges adjacent sorted segments of length `width`.

## Parallel Merge

A naive merge is sequential. GPU merge splits work using partitioning.

```text id="g1w0tr"
parallel_merge(A, B, C):
    partition output range into chunks

    for each chunk:
        find corresponding split points in A and B using binary search
        merge subranges independently
```

Each thread block merges a portion of the output. Partitioning ensures balanced work.

## Merge Path Idea

For two sorted arrays $A$ and $B$, the merged output lies on a grid. A diagonal partition splits the merge into independent regions.

Given diagonal index $d$, find indices $i$ and $j$ such that:

$$
i + j = d
$$

and

$$
A[i - 1] \le B[j] \quad \text{and} \quad B[j - 1] \le A[i]
$$

These conditions define a valid partition point.

## Complexity

| measure       | value            |
| ------------- | ---------------- |
| passes        | $O(\log n)$      |
| work per pass | $O(n)$           |
| total work    | $O(n \log n)$    |
| parallel time | $O((n/p)\log n)$ |
| extra space   | $O(n)$           |

Performance depends on efficient partitioning and memory access.

## Correctness

Each pass merges pairs of sorted runs into larger sorted runs. Since merging preserves order, and each pass doubles run size, after $\log n$ passes the entire array is sorted.

Parallel partitioning divides merge work without altering merge correctness because each partition corresponds to a disjoint portion of the output.

## Practical Considerations

* Memory bandwidth dominates performance.
* Coalesced memory access is critical.
* Partitioning must balance load across blocks.
* Small runs can be sorted using shared memory before global merging.
* Requires auxiliary array of size $n$.
* Often combined with block level sorting primitives.

## When to Use

Use GPU merge sort when:

* keys require custom comparison
* stability is required
* data size is large
* radix sort is not applicable

Avoid it when integer keys allow faster radix sorting or when data transfer cost dominates.

## Implementation Sketch

```cuda id="8s9f0n"
__global__ void merge_pass(
    const int *in,
    int *out,
    int n,
    int width
) {
    int start = (blockIdx.x * blockDim.x + threadIdx.x) * (2 * width);

    if (start >= n) return;

    int mid = min(start + width, n);
    int end = min(start + 2 * width, n);

    int i = start;
    int j = mid;
    int k = start;

    while (i < mid && j < end) {
        if (in[i] <= in[j]) {
            out[k++] = in[i++];
        } else {
            out[k++] = in[j++];
        }
    }

    while (i < mid) out[k++] = in[i++];
    while (j < end) out[k++] = in[j++];
}
```

```c id="0k3qhp"
void gpu_merge_sort(int *d_a, int n) {
    int *d_b;
    cudaMalloc(&d_b, n * sizeof(int));

    for (int width = 1; width < n; width <<= 1) {
        int blocks = (n + (2 * width) - 1) / (2 * width);
        merge_pass<<<blocks, 1>>>(d_a, d_b, n, width);

        int *tmp = d_a;
        d_a = d_b;
        d_b = tmp;
    }

    cudaFree(d_b);
}
```

