Process multiple array elements per instruction using SIMD-friendly layout and loops.
Array vectorization processes multiple elements at once using wide CPU instructions. Instead of applying an operation to one value per instruction, the CPU applies it to a vector of values.
You use it when arrays are large, element types are uniform, and the same operation applies to many consecutive elements.
Problem
Given arrays and of length , compute an output array such that:
for every valid index .
Structure
Vectorization works best when data is contiguous:
A vector register can load several adjacent values at once:
Algorithm
Process elements in fixed-width chunks, then handle the remainder.
vector_add(A, B):
n = length(A)
C = allocate(n)
width = vector_width()
i = 0
while i + width <= n:
va = load_vector(A, i)
vb = load_vector(B, i)
vc = va + vb
store_vector(C, i, vc)
i += width
while i < n:
C[i] = A[i] + B[i]
i += 1
return CExample
Let
and
With vector width :
| step | indices | operation | result |
|---|---|---|---|
| 1 | 0 to 3 | [1, 2, 3, 4] + [10, 20, 30, 40] | [11, 22, 33, 44] |
| 2 | 4 to 5 | scalar remainder | [55, 66] |
Final result:
Correctness
The vector loop processes disjoint contiguous blocks of width vector_width(). For each block, corresponding elements from and are loaded, added, and stored at the same positions in .
The scalar loop handles every remaining index after the last full vector block. Since the vector blocks and remainder cover exactly the range , every output value satisfies .
Complexity
| operation | time |
|---|---|
| vector add |
Vectorization does not change asymptotic complexity. It improves constant factors by doing more work per instruction.
Space usage:
for the output array.
When to Use
Array vectorization is appropriate when:
- data is contiguous
- operations are simple and uniform
- arrays are large enough to amortize setup cost
- branches inside the loop are minimal
It is less suitable when:
- access is irregular
- each element needs different control flow
- data contains pointer-heavy objects
- memory alignment or layout prevents efficient vector loads
Implementation
def vector_add(a, b):
if len(a) != len(b):
raise ValueError("length mismatch")
c = [0] * len(a)
for i in range(len(a)):
c[i] = a[i] + b[i]
return cfunc VectorAdd(a, b []int) ([]int, bool) {
if len(a) != len(b) {
return nil, false
}
c := make([]int, len(a))
for i := 0; i < len(a); i++ {
c[i] = a[i] + b[i]
}
return c, true
}