SIMD Bitonic Sort

SIMD bitonic sort sorts a small fixed number of elements using vector registers. It maps a bitonic sorting network onto SIMD lanes, replacing scalar compare and exchange with vector min and max operations plus lane shuffles.

The schedule is fixed. No branches depend on data. This property fits SIMD execution where all lanes execute the same instruction.

Problem

Given a vector of $W$ elements held in SIMD registers, sort them in nondecreasing order.

Typical values of $W$ are 4, 8, 16, or 32 depending on the instruction set.

Algorithm

Use a bitonic network where each stage performs lane wise compare and exchange with a permuted partner vector.

simd_bitonic_sort(v):
    for each (perm, mask) stage in network:
        w = permute(v, perm)

        lo = min(v, w)
        hi = max(v, w)

        v = blend(lo, hi, mask)

    return v

permute rearranges lanes to align comparison partners
min and max operate lane wise
blend selects from lo or hi per lane according to a mask

The network stages depend only on $W$ .

Example for 8 lanes

For $W = 8$ , a common sequence uses XOR style pairings:

stage 1: compare distance 1
stage 2: compare distance 2
stage 3: compare distance 1
stage 4: compare distance 4
stage 5: compare distance 2
stage 6: compare distance 1

Each stage uses a different permutation and blend mask.

Compare and Exchange

Given vectors v and w:

lo = min(v, w)
hi = max(v, w)
v  = blend(lo, hi, mask)

The mask encodes which lanes keep the smaller value and which keep the larger value.

Complexity

measure	value
elements	$W$
stages	$O(\log^2 W)$
comparisons	fixed
branches	none

All operations are vector instructions.

Correctness

The bitonic network first creates bitonic subsequences, then merges them into sorted order. Each stage moves smaller values toward the low index lanes and larger values toward the high index lanes. After all stages, the vector is sorted.

Because the schedule is fixed, correctness does not depend on input order.

Practical Considerations

Efficient for small arrays that fit in registers.
Avoids branch misprediction.
Requires efficient lane permutation instructions.
Often used as a base case inside larger sorting algorithms.
Key value pairs require synchronized permutation of both arrays.
Performance depends on instruction set such as SSE, AVX, AVX512, or NEON.

When to Use

Use SIMD bitonic sort when:

sorting very small arrays
building a hybrid sorting algorithm
implementing vectorized kernels
branch free execution is important

Avoid it for large arrays alone. It is a local primitive.

Implementation Sketch (AVX2, 8 integers)

#include <immintrin.h>

__m256i simd_bitonic8(__m256i v) {
    __m256i w, lo, hi;

    // stage 1
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(6,7,4,5,2,3,0,1));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b10101010);

    // stage 2
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(5,4,7,6,1,0,3,2));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b11001100);

    // stage 3
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(7,5,6,4,3,1,2,0));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b11110000);

    return v;
}

void sort8(int *a) {
    __m256i v = _mm256_loadu_si256((__m256i*)a);
    v = simd_bitonic8(v);
    _mm256_storeu_si256((__m256i*)a, v);
}