Sort small vectors using SIMD registers with a fixed bitonic compare exchange network.
SIMD bitonic sort sorts a small fixed number of elements using vector registers. It maps a bitonic sorting network onto SIMD lanes, replacing scalar compare and exchange with vector min and max operations plus lane shuffles.
The schedule is fixed. No branches depend on data. This property fits SIMD execution where all lanes execute the same instruction.
Problem
Given a vector of elements held in SIMD registers, sort them in nondecreasing order.
Typical values of are 4, 8, 16, or 32 depending on the instruction set.
Algorithm
Use a bitonic network where each stage performs lane wise compare and exchange with a permuted partner vector.
simd_bitonic_sort(v):
for each (perm, mask) stage in network:
w = permute(v, perm)
lo = min(v, w)
hi = max(v, w)
v = blend(lo, hi, mask)
return vpermuterearranges lanes to align comparison partnersminandmaxoperate lane wiseblendselects fromloorhiper lane according to a mask
The network stages depend only on .
Example for 8 lanes
For , a common sequence uses XOR style pairings:
stage 1: compare distance 1
stage 2: compare distance 2
stage 3: compare distance 1
stage 4: compare distance 4
stage 5: compare distance 2
stage 6: compare distance 1Each stage uses a different permutation and blend mask.
Compare and Exchange
Given vectors v and w:
lo = min(v, w)
hi = max(v, w)
v = blend(lo, hi, mask)The mask encodes which lanes keep the smaller value and which keep the larger value.
Complexity
| measure | value |
|---|---|
| elements | |
| stages | |
| comparisons | fixed |
| branches | none |
All operations are vector instructions.
Correctness
The bitonic network first creates bitonic subsequences, then merges them into sorted order. Each stage moves smaller values toward the low index lanes and larger values toward the high index lanes. After all stages, the vector is sorted.
Because the schedule is fixed, correctness does not depend on input order.
Practical Considerations
- Efficient for small arrays that fit in registers.
- Avoids branch misprediction.
- Requires efficient lane permutation instructions.
- Often used as a base case inside larger sorting algorithms.
- Key value pairs require synchronized permutation of both arrays.
- Performance depends on instruction set such as SSE, AVX, AVX512, or NEON.
When to Use
Use SIMD bitonic sort when:
- sorting very small arrays
- building a hybrid sorting algorithm
- implementing vectorized kernels
- branch free execution is important
Avoid it for large arrays alone. It is a local primitive.
Implementation Sketch (AVX2, 8 integers)
#include <immintrin.h>
__m256i simd_bitonic8(__m256i v) {
__m256i w, lo, hi;
// stage 1
w = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(6,7,4,5,2,3,0,1));
lo = _mm256_min_epi32(v, w);
hi = _mm256_max_epi32(v, w);
v = _mm256_blend_epi32(lo, hi, 0b10101010);
// stage 2
w = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(5,4,7,6,1,0,3,2));
lo = _mm256_min_epi32(v, w);
hi = _mm256_max_epi32(v, w);
v = _mm256_blend_epi32(lo, hi, 0b11001100);
// stage 3
w = _mm256_permutevar8x32_epi32(v, _mm256_set_epi32(7,5,6,4,3,1,2,0));
lo = _mm256_min_epi32(v, w);
hi = _mm256_max_epi32(v, w);
v = _mm256_blend_epi32(lo, hi, 0b11110000);
return v;
}void sort8(int *a) {
__m256i v = _mm256_loadu_si256((__m256i*)a);
v = simd_bitonic8(v);
_mm256_storeu_si256((__m256i*)a, v);
}