Skip to content

SIMD Bitonic Sort

Sort small vectors using SIMD registers with a fixed bitonic compare exchange network.

SIMD bitonic sort sorts a small fixed number of elements using vector registers. It maps a bitonic sorting network onto SIMD lanes, replacing scalar compare and exchange with vector min and max operations plus lane shuffles.

The schedule is fixed. No branches depend on data. This property fits SIMD execution where all lanes execute the same instruction.

Problem

Given a vector of WW elements held in SIMD registers, sort them in nondecreasing order.

Typical values of WW are 4, 8, 16, or 32 depending on the instruction set.

Algorithm

Use a bitonic network where each stage performs lane wise compare and exchange with a permuted partner vector.

simd_bitonic_sort(v):
    for each (perm, mask) stage in network:
        w = permute(v, perm)

        lo = min(v, w)
        hi = max(v, w)

        v = blend(lo, hi, mask)

    return v
  • permute rearranges lanes to align comparison partners
  • min and max operate lane wise
  • blend selects from lo or hi per lane according to a mask

The network stages depend only on WW.

Example for 8 lanes

For W=8W = 8, a common sequence uses XOR style pairings:

stage 1: compare distance 1
stage 2: compare distance 2
stage 3: compare distance 1
stage 4: compare distance 4
stage 5: compare distance 2
stage 6: compare distance 1

Each stage uses a different permutation and blend mask.

Compare and Exchange

Given vectors v and w:

lo = min(v, w)
hi = max(v, w)
v  = blend(lo, hi, mask)

The mask encodes which lanes keep the smaller value and which keep the larger value.

Complexity

measurevalue
elementsWW
stagesO(log2W)O(\log^2 W)
comparisonsfixed
branchesnone

All operations are vector instructions.

Correctness

The bitonic network first creates bitonic subsequences, then merges them into sorted order. Each stage moves smaller values toward the low index lanes and larger values toward the high index lanes. After all stages, the vector is sorted.

Because the schedule is fixed, correctness does not depend on input order.

Practical Considerations

  • Efficient for small arrays that fit in registers.
  • Avoids branch misprediction.
  • Requires efficient lane permutation instructions.
  • Often used as a base case inside larger sorting algorithms.
  • Key value pairs require synchronized permutation of both arrays.
  • Performance depends on instruction set such as SSE, AVX, AVX512, or NEON.

When to Use

Use SIMD bitonic sort when:

  • sorting very small arrays
  • building a hybrid sorting algorithm
  • implementing vectorized kernels
  • branch free execution is important

Avoid it for large arrays alone. It is a local primitive.

Implementation Sketch (AVX2, 8 integers)

#include <immintrin.h>

__m256i simd_bitonic8(__m256i v) {
    __m256i w, lo, hi;

    // stage 1
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(6,7,4,5,2,3,0,1));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b10101010);

    // stage 2
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(5,4,7,6,1,0,3,2));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b11001100);

    // stage 3
    w  = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(7,5,6,4,3,1,2,0));
    lo = _mm256_min_epi32(v, w);
    hi = _mm256_max_epi32(v, w);
    v  = _mm256_blend_epi32(lo, hi, 0b11110000);

    return v;
}
void sort8(int *a) {
    __m256i v = _mm256_loadu_si256((__m256i*)a);
    v = simd_bitonic8(v);
    _mm256_storeu_si256((__m256i*)a, v);
}