Sort a fixed size vector by mapping a data independent sorting network to SIMD compare, shuffle, and blend operations.
A SIMD sorting network sorts a fixed number of elements using a predetermined sequence of compare and exchange operations. Each stage can be mapped to vector instructions such as lane permutation, vector minimum, vector maximum, and blend.
Unlike comparison sorts with branches, the schedule does not depend on input values. This makes the method useful for short arrays, database kernels, vectorized partitioning, and base cases inside larger sorting routines.
Problem
Given values stored in SIMD lanes, sort them in nondecreasing order.
The value of is fixed by the vector width and element type. For example, AVX2 can hold eight 32 bit integers in one 256 bit register.
Algorithm
A sorting network is a list of comparator stages. Each comparator connects two lanes.
simd_sorting_network(v):
for each network stage:
w = permute(v, stage.partner_order)
lo = min(v, w)
hi = max(v, w)
v = blend(lo, hi, stage.keep_high_mask)
return vThe stage tells each lane which partner it compares with and whether the lane should keep the lower or higher value.
Comparator
For a pair of lanes with , the compare and exchange operation is:
if A[i] > A[j]:
swap A[i], A[j]In SIMD form, many independent comparators are performed at once:
lo = min(v, partner)
hi = max(v, partner)
v = blend(lo, hi, mask)Example Network
For four values, one valid sorting network is:
compare (0, 1), (2, 3)
compare (0, 2), (1, 3)
compare (1, 2)This fixed sequence sorts every input of length four.
Complexity
| measure | value |
|---|---|
| input size | fixed |
| comparator count | network dependent |
| depth | network dependent |
| branches | none |
| extra memory | usually |
For small fixed widths, constants matter more than asymptotic notation.
Correctness
A sorting network is correct if its comparator sequence sorts every possible input. By the zero one principle, it is enough to verify correctness on all binary inputs. If the network sorts every sequence of zeros and ones, it sorts every totally ordered input.
SIMD execution preserves the same comparator sequence. Vectorization changes how comparisons are executed, not which comparisons occur.
Practical Considerations
- Good for fixed small sizes.
- Avoids branch misprediction.
- Performs predictable memory and register operations.
- Requires efficient shuffle or permute instructions.
- Different networks are optimal for different widths.
- Often used as a base case before merging larger sorted blocks.
When to Use
Use a SIMD sorting network when:
- the input size is small and fixed
- values fit in SIMD registers
- branch free execution is important
- predictable latency matters
- the sort is part of a larger vectorized algorithm
Avoid it for large variable length arrays as a complete sort. Use it as a kernel or base case.
Implementation Sketch
#define SORT2(a, b) \
do { \
int x = (a); \
int y = (b); \
(a) = x < y ? x : y; \
(b) = x < y ? y : x; \
} while (0)
void network_sort4(int a[4]) {
SORT2(a[0], a[1]);
SORT2(a[2], a[3]);
SORT2(a[0], a[2]);
SORT2(a[1], a[3]);
SORT2(a[1], a[2]);
}#include <immintrin.h>
__m128i simd_network_sort4(__m128i v) {
__m128i w, lo, hi;
// compare (0,1), (2,3)
w = _mm_shuffle_epi32(v, _MM_SHUFFLE(2,3,0,1));
lo = _mm_min_epi32(v, w);
hi = _mm_max_epi32(v, w);
v = _mm_blend_epi16(lo, hi, 0b11001100);
// compare (0,2), (1,3)
w = _mm_shuffle_epi32(v, _MM_SHUFFLE(1,0,3,2));
lo = _mm_min_epi32(v, w);
hi = _mm_max_epi32(v, w);
v = _mm_blend_epi16(lo, hi, 0b11110000);
// compare (1,2)
w = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,1,2,0));
lo = _mm_min_epi32(v, w);
hi = _mm_max_epi32(v, w);
v = _mm_blend_epi16(lo, hi, 0b00110000);
return v;
}