SIMD Sorting Network

A SIMD sorting network sorts a fixed number of elements using a predetermined sequence of compare and exchange operations. Each stage can be mapped to vector instructions such as lane permutation, vector minimum, vector maximum, and blend.

Unlike comparison sorts with branches, the schedule does not depend on input values. This makes the method useful for short arrays, database kernels, vectorized partitioning, and base cases inside larger sorting routines.

Problem

Given $W$ values stored in SIMD lanes, sort them in nondecreasing order.

The value of $W$ is fixed by the vector width and element type. For example, AVX2 can hold eight 32 bit integers in one 256 bit register.

Algorithm

A sorting network is a list of comparator stages. Each comparator connects two lanes.

simd_sorting_network(v):
    for each network stage:
        w = permute(v, stage.partner_order)

        lo = min(v, w)
        hi = max(v, w)

        v = blend(lo, hi, stage.keep_high_mask)

    return v

The stage tells each lane which partner it compares with and whether the lane should keep the lower or higher value.

Comparator

For a pair of lanes $(i, j)$ with $i < j$ , the compare and exchange operation is:

if A[i] > A[j]:
    swap A[i], A[j]

In SIMD form, many independent comparators are performed at once:

lo = min(v, partner)
hi = max(v, partner)
v  = blend(lo, hi, mask)

Example Network

For four values, one valid sorting network is:

compare (0, 1), (2, 3)
compare (0, 2), (1, 3)
compare (1, 2)

This fixed sequence sorts every input of length four.

Complexity

measure	value
input size	fixed $W$
comparator count	network dependent
depth	network dependent
branches	none
extra memory	usually $O(1)$

For small fixed widths, constants matter more than asymptotic notation.

Correctness

A sorting network is correct if its comparator sequence sorts every possible input. By the zero one principle, it is enough to verify correctness on all binary inputs. If the network sorts every sequence of zeros and ones, it sorts every totally ordered input.

SIMD execution preserves the same comparator sequence. Vectorization changes how comparisons are executed, not which comparisons occur.

Practical Considerations

Good for fixed small sizes.
Avoids branch misprediction.
Performs predictable memory and register operations.
Requires efficient shuffle or permute instructions.
Different networks are optimal for different widths.
Often used as a base case before merging larger sorted blocks.

When to Use

Use a SIMD sorting network when:

the input size is small and fixed
values fit in SIMD registers
branch free execution is important
predictable latency matters
the sort is part of a larger vectorized algorithm

Avoid it for large variable length arrays as a complete sort. Use it as a kernel or base case.

Implementation Sketch

#define SORT2(a, b)       \
    do {                  \
        int x = (a);      \
        int y = (b);      \
        (a) = x < y ? x : y; \
        (b) = x < y ? y : x; \
    } while (0)

void network_sort4(int a[4]) {
    SORT2(a[0], a[1]);
    SORT2(a[2], a[3]);

    SORT2(a[0], a[2]);
    SORT2(a[1], a[3]);

    SORT2(a[1], a[2]);
}

#include <immintrin.h>

__m128i simd_network_sort4(__m128i v) {
    __m128i w, lo, hi;

    // compare (0,1), (2,3)
    w  = _mm_shuffle_epi32(v, _MM_SHUFFLE(2,3,0,1));
    lo = _mm_min_epi32(v, w);
    hi = _mm_max_epi32(v, w);
    v  = _mm_blend_epi16(lo, hi, 0b11001100);

    // compare (0,2), (1,3)
    w  = _mm_shuffle_epi32(v, _MM_SHUFFLE(1,0,3,2));
    lo = _mm_min_epi32(v, w);
    hi = _mm_max_epi32(v, w);
    v  = _mm_blend_epi16(lo, hi, 0b11110000);

    // compare (1,2)
    w  = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,1,2,0));
    lo = _mm_min_epi32(v, w);
    hi = _mm_max_epi32(v, w);
    v  = _mm_blend_epi16(lo, hi, 0b00110000);

    return v;
}