# SIMD Bitonic Sort

# SIMD Bitonic Sort

SIMD bitonic sort sorts a small fixed number of elements using vector registers. It maps a bitonic sorting network onto SIMD lanes, replacing scalar compare and exchange with vector min and max operations plus lane shuffles.

The schedule is fixed. No branches depend on data. This property fits SIMD execution where all lanes execute the same instruction.

## Problem

Given a vector of $W$ elements held in SIMD registers, sort them in nondecreasing order.

Typical values of $W$ are 4, 8, 16, or 32 depending on the instruction set.

## Algorithm

Use a bitonic network where each stage performs lane wise compare and exchange with a permuted partner vector.

```text id="7v9gqk"
simd_bitonic_sort(v):
    for each (perm, mask) stage in network:
        w = permute(v, perm)

        lo = min(v, w)
        hi = max(v, w)

        v = blend(lo, hi, mask)

    return v
```

* `permute` rearranges lanes to align comparison partners
* `min` and `max` operate lane wise
* `blend` selects from `lo` or `hi` per lane according to a mask

The network stages depend only on $W$.

## Example for 8 lanes

For $W = 8$, a common sequence uses XOR style pairings:

```text id="1u5d1b"
stage 1: compare distance 1
stage 2: compare distance 2
stage 3: compare distance 1
stage 4: compare distance 4
stage 5: compare distance 2
stage 6: compare distance 1
```

Each stage uses a different permutation and blend mask.

## Compare and Exchange

Given vectors `v` and `w`:

```text id="3qk0bf"
lo = min(v, w)
hi = max(v, w)
v  = blend(lo, hi, mask)
```

The mask encodes which lanes keep the smaller value and which keep the larger value.

## Complexity

| measure     | value         |
| ----------- | ------------- |
| elements    | $W$           |
| stages      | $O(\log^2 W)$ |
| comparisons | fixed         |
| branches    | none          |

All operations are vector instructions.

## Correctness

The bitonic network first creates bitonic subsequences, then merges them into sorted order. Each stage moves smaller values toward the low index lanes and larger values toward the high index lanes. After all stages, the vector is sorted.

Because the schedule is fixed, correctness does not depend on input order.

## Practical Considerations

* Efficient for small arrays that fit in registers.
* Avoids branch misprediction.
* Requires efficient lane permutation instructions.
* Often used as a base case inside larger sorting algorithms.
* Key value pairs require synchronized permutation of both arrays.
* Performance depends on instruction set such as SSE, AVX, AVX512, or NEON.

## When to Use

Use SIMD bitonic sort when:

* sorting very small arrays
* building a hybrid sorting algorithm
* implementing vectorized kernels
* branch free execution is important

Avoid it for large arrays alone. It is a local primitive.

## Implementation Sketch (AVX2, 8 integers)

```c id="5b9l2q"
#include <immintrin.h>

__m256i simd_bitonic8(__m256i v) {
    __m256i w, lo, hi;

    // stage 1
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(6,7,4,5,2,3,0,1));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b10101010);

    // stage 2
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(5,4,7,6,1,0,3,2));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b11001100);

    // stage 3
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(7,5,6,4,3,1,2,0));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b11110000);

    return v;
}
```

```c id="8j2w6r"
void sort8(int *a) {
    __m256i v = _mm256_loadu_si256((__m256i*)a);
    v = simd_bitonic8(v);
    _mm256_storeu_si256((__m256i*)a, v);
}
```

