Skip to content

Array Vectorization

Process multiple array elements per instruction using SIMD-friendly layout and loops.

Array vectorization processes multiple elements at once using wide CPU instructions. Instead of applying an operation to one value per instruction, the CPU applies it to a vector of values.

You use it when arrays are large, element types are uniform, and the same operation applies to many consecutive elements.

Problem

Given arrays AA and BB of length nn, compute an output array CC such that:

C[i]=A[i]+B[i] C[i] = A[i] + B[i]

for every valid index ii.

Structure

Vectorization works best when data is contiguous:

A=[a0,a1,a2,,an1] A = [a_0, a_1, a_2, \dots, a_{n-1}]

A vector register can load several adjacent values at once:

[ai,ai+1,ai+2,ai+3] [a_i, a_{i+1}, a_{i+2}, a_{i+3}]

Algorithm

Process elements in fixed-width chunks, then handle the remainder.

vector_add(A, B):
    n = length(A)
    C = allocate(n)

    width = vector_width()

    i = 0
    while i + width <= n:
        va = load_vector(A, i)
        vb = load_vector(B, i)
        vc = va + vb
        store_vector(C, i, vc)
        i += width

    while i < n:
        C[i] = A[i] + B[i]
        i += 1

    return C

Example

Let

A=[1,2,3,4,5,6] A = [1, 2, 3, 4, 5, 6]

and

B=[10,20,30,40,50,60] B = [10, 20, 30, 40, 50, 60]

With vector width 44:

stepindicesoperationresult
10 to 3[1, 2, 3, 4] + [10, 20, 30, 40][11, 22, 33, 44]
24 to 5scalar remainder[55, 66]

Final result:

C=[11,22,33,44,55,66] C = [11, 22, 33, 44, 55, 66]

Correctness

The vector loop processes disjoint contiguous blocks of width vector_width(). For each block, corresponding elements from AA and BB are loaded, added, and stored at the same positions in CC.

The scalar loop handles every remaining index after the last full vector block. Since the vector blocks and remainder cover exactly the range [0,n)[0, n), every output value satisfies C[i]=A[i]+B[i]C[i] = A[i] + B[i].

Complexity

operationtime
vector addO(n)O(n)

Vectorization does not change asymptotic complexity. It improves constant factors by doing more work per instruction.

Space usage:

O(n) O(n)

for the output array.

When to Use

Array vectorization is appropriate when:

  • data is contiguous
  • operations are simple and uniform
  • arrays are large enough to amortize setup cost
  • branches inside the loop are minimal

It is less suitable when:

  • access is irregular
  • each element needs different control flow
  • data contains pointer-heavy objects
  • memory alignment or layout prevents efficient vector loads

Implementation

def vector_add(a, b):
    if len(a) != len(b):
        raise ValueError("length mismatch")

    c = [0] * len(a)
    for i in range(len(a)):
        c[i] = a[i] + b[i]

    return c
func VectorAdd(a, b []int) ([]int, bool) {
    if len(a) != len(b) {
        return nil, false
    }

    c := make([]int, len(a))
    for i := 0; i < len(a); i++ {
        c[i] = a[i] + b[i]
    }

    return c, true
}