Array Vectorization

Array vectorization processes multiple elements at once using wide CPU instructions. Instead of applying an operation to one value per instruction, the CPU applies it to a vector of values.

You use it when arrays are large, element types are uniform, and the same operation applies to many consecutive elements.

Problem

Given arrays $A$ and $B$ of length $n$, compute an output array $C$ such that:

$$ C[i] = A[i] + B[i] $$

for every valid index $i$.

Structure

Vectorization works best when data is contiguous:

$$ A = [a_0, a_1, a_2, \dots, a_{n-1}] $$

A vector register can load several adjacent values at once:

$$ [a_i, a_{i+1}, a_{i+2}, a_{i+3}] $$

Algorithm

Process elements in fixed-width chunks, then handle the remainder.

vector_add(A, B):
    n = length(A)
    C = allocate(n)

    width = vector_width()

    i = 0
    while i + width <= n:
        va = load_vector(A, i)
        vb = load_vector(B, i)
        vc = va + vb
        store_vector(C, i, vc)
        i += width

    while i < n:
        C[i] = A[i] + B[i]
        i += 1

    return C

Example

Let

$$ A = [1, 2, 3, 4, 5, 6] $$

and

$$ B = [10, 20, 30, 40, 50, 60] $$

With vector width $4$:

step	indices	operation	result
1	0 to 3	[1, 2, 3, 4] + [10, 20, 30, 40]	[11, 22, 33, 44]
2	4 to 5	scalar remainder	[55, 66]

Final result:

$$ C = [11, 22, 33, 44, 55, 66] $$

Correctness

The vector loop processes disjoint contiguous blocks of width vector_width(). For each block, corresponding elements from $A$ and $B$ are loaded, added, and stored at the same positions in $C$.

The scalar loop handles every remaining index after the last full vector block. Since the vector blocks and remainder cover exactly the range $[0, n)$, every output value satisfies $C[i] = A[i] + B[i]$.

Complexity

operation	time
vector add	$O(n)$

Vectorization does not change asymptotic complexity. It improves constant factors by doing more work per instruction.

Space usage:

$$ O(n) $$

for the output array.

When to Use

Array vectorization is appropriate when:

data is contiguous
operations are simple and uniform
arrays are large enough to amortize setup cost
branches inside the loop are minimal

It is less suitable when:

access is irregular
each element needs different control flow
data contains pointer-heavy objects
memory alignment or layout prevents efficient vector loads

Implementation

def vector_add(a, b):
    if len(a) != len(b):
        raise ValueError("length mismatch")

    c = [0] * len(a)
    for i in range(len(a)):
        c[i] = a[i] + b[i]

    return c

func VectorAdd(a, b []int) ([]int, bool) {
    if len(a) != len(b) {
        return nil, false
    }

    c := make([]int, len(a))
    for i := 0; i < len(a); i++ {
        c[i] = a[i] + b[i]
    }

    return c, true
}