15.11 Cache-Oblivious Algorithms

Problem

An algorithm can have good asymptotic complexity and still run poorly on real hardware.

Modern machines do not access all memory at the same speed. Data may live in registers, L1 cache, L2 cache, L3 cache, main memory, or disk. Moving data between these levels can cost more than arithmetic.

A conventional algorithm often assumes this model:

memory access costs O(1)

Real machines behave more like this:

nearby memory access is cheap
faraway memory access is expensive

Can divide and conquer improve locality without explicitly tuning the algorithm to a particular cache size?

Solution

Use cache-oblivious design.

A cache-oblivious algorithm does not know the cache size or cache-line length. It is not manually blocked with constants such as:

tile size = 64
block size = 256

Instead, it recursively decomposes the problem until subproblems naturally fit into every level of the memory hierarchy.

The same code adapts to L1 cache, L2 cache, L3 cache, and even external memory.

The Core Idea

Divide a problem into smaller subproblems.

Then divide those subproblems again.

Eventually, some recursive subproblem becomes small enough to fit in cache.

The algorithm does not need to know when this happens. The hardware cache captures the benefit automatically.

This is why the design is called cache-oblivious.

It is aware that locality matters, but oblivious to the specific cache parameters.

Cache-Aware vs Cache-Oblivious

A cache-aware algorithm chooses explicit block sizes.

Example:

process matrix in 64x64 tiles

This can be very fast, but the tile size depends on hardware.

A cache-oblivious algorithm uses recursive subdivision instead.

Example:

process matrix quadrant
process subquadrant
process subsubquadrant
...

Eventually the active region becomes small enough to fit into cache.

Approach	Knows Cache Size?	Typical Technique
Cache-aware	Yes	Fixed blocking
Cache-oblivious	No	Recursive decomposition

Cache-aware code is often easier to tune for one machine. Cache-oblivious code is often more portable across machines.

A Simple Example: Summing an Array

A plain loop is already cache-friendly:

func Sum(nums []int) int {
    total := 0

    for _, x := range nums {
        total += x
    }

    return total
}

Sequential traversal uses spatial locality well.

A recursive version is usually not faster:

func RecursiveSum(nums []int) int {
    if len(nums) == 0 {
        return 0
    }

    if len(nums) == 1 {
        return nums[0]
    }

    mid := len(nums) / 2

    return RecursiveSum(nums[:mid]) +
        RecursiveSum(nums[mid:])
}

This example is useful because it shows a boundary. Cache-oblivious design helps most when the natural access pattern is not already linear.

Arrays are easy. Matrices, trees, graphs, and divide-and-conquer transforms are where locality becomes more interesting.

Matrix Transpose

Consider transposing an (n \times n) matrix.

A naive implementation:

func Transpose(a [][]int, b [][]int) {
    n := len(a)

    for i := 0; i < n; i++ {
        for j := 0; j < n; j++ {
            b[j][i] = a[i][j]
        }
    }
}

The read from a[i][j] is row-major and cache-friendly if rows are contiguous.

The write to b[j][i] walks down columns. In row-major storage, that creates poor locality.

A cache-oblivious transpose divides the matrix into quadrants.

A =
[ A11 A12 ]
[ A21 A22 ]

Transpose recursively:

transpose A11
transpose A22
transpose A12 into opposite block
transpose A21 into opposite block

As submatrices get smaller, both reads and writes fit into cache.

Recursive Transpose Sketch

Assume a flat row-major matrix.

type Matrix struct {
    Data []int
    N    int
}

func (m Matrix) At(i, j int) int {
    return m.Data[i*m.N+j]
}

func (m Matrix) Set(i, j, v int) {
    m.Data[i*m.N+j] = v
}

Recursive transpose:

func TransposeRec(src, dst Matrix,
    r0, c0 int,
    r1, c1 int,
    outR, outC int,
) {
    rows := r1 - r0
    cols := c1 - c0

    if rows == 0 || cols == 0 {
        return
    }

    if rows*cols <= 1024 {
        for i := 0; i < rows; i++ {
            for j := 0; j < cols; j++ {
                dst.Set(outR+j, outC+i,
                    src.At(r0+i, c0+j))
            }
        }
        return
    }

    if rows >= cols {
        mid := r0 + rows/2

        TransposeRec(src, dst,
            r0, c0,
            mid, c1,
            outR, outC)

        TransposeRec(src, dst,
            mid, c0,
            r1, c1,
            outR, outC+(mid-r0))
    } else {
        mid := c0 + cols/2

        TransposeRec(src, dst,
            r0, c0,
            r1, mid,
            outR, outC)

        TransposeRec(src, dst,
            r0, mid,
            r1, c1,
            outR+(mid-c0), outC)
    }
}

The base-case threshold is still explicit here, but it is not a hardware cache parameter. It is an engineering cutoff to avoid function-call overhead for tiny subproblems.

Matrix Multiplication

Classical matrix multiplication often has poor cache behavior.

The naive loop order:

for i := 0; i < n; i++ {
    for j := 0; j < n; j++ {
        for k := 0; k < n; k++ {
            c[i][j] += a[i][k] * b[k][j]
        }
    }
}

Access to a[i][k] is sequential.

Access to b[k][j] walks down a column.

That is often cache-unfriendly in row-major layout.

A cache-aware implementation uses explicit tiling:

for ii in blocks
    for jj in blocks
        for kk in blocks
            multiply block

A cache-oblivious implementation recursively splits matrices into quadrants and multiplies blocks until the active submatrices become small.

Recursive Matrix Multiplication Structure

Split:

$$ A= \begin{bmatrix} A_{11} & A_{12}\\nA_{21} & A_{22} \end{bmatrix} $$

$$ B= \begin{bmatrix} B_{11} & B_{12}\\nB_{21} & B_{22} \end{bmatrix} $$

Then:

$$ C_{11}=A_{11}B_{11}+A_{12}B_{21} $$

$$ C_{12}=A_{11}B_{12}+A_{12}B_{22} $$

$$ C_{21}=A_{21}B_{11}+A_{22}B_{21} $$

$$ C_{22}=A_{21}B_{12}+A_{22}B_{22} $$

This is not Strassen. It still performs eight recursive multiplications. The point here is locality, not reducing arithmetic count.

The asymptotic arithmetic remains:

$$ O(n^3) $$

but memory behavior improves because each recursive block eventually fits into cache.

Why Recursion Improves Locality

Suppose a matrix block is too large for cache.

The recursive algorithm splits it.

Eventually a subproblem contains:

small block of A
small block of B
small block of C

that all fit in cache at the same time.

The computation then reuses those cached values many times before returning to larger blocks.

A naive row-column loop may evict data before it is reused.

Recursive blocking turns spatial structure into temporal locality.

Cache-Oblivious Search Trees

Divide-and-conquer layout can also improve search structures.

A balanced binary search tree stored as pointer-linked nodes has poor locality. A search may jump unpredictably through memory:

root -> child -> grandchild -> ...

Each step may touch a different cache line.

A cache-oblivious layout stores nodes in recursive order, such as van Emde Boas layout.

The idea:

Store the top part of the tree.
Recursively store each bottom subtree.
Repeat.

This places nodes likely to be visited together near each other in memory, without choosing a fixed cache block size.

Funnelsort and Cache-Oblivious Sorting

Sorting also has cache-oblivious variants.

Funnelsort is a theoretical cache-oblivious sorting algorithm that achieves optimal memory-transfer bounds under standard cache models.

In practice, production sorting libraries usually use engineered hybrids such as quicksort, introsort, radix sort, or tiled merge variants.

Still, funnelsort is important because it shows that cache-oblivious algorithms can be asymptotically optimal not only in CPU operations but also in memory movement.

The Ideal Cache Model

Cache-oblivious analysis often uses the ideal cache model.

It assumes:

cache size (M)
cache line size (B)
fully associative cache
optimal replacement policy

The algorithm is not allowed to know (M) or (B).

The analysis counts memory transfers between slow and fast memory.

Even though real hardware differs from this model, it provides a useful way to reason about locality independent of a specific processor.

Practical Engineering Cutoffs

Real implementations still need base cases.

A purely recursive algorithm that recurses down to single cells may suffer from overhead.

Typical base cases use thresholds such as:

if subproblem is small enough
    run simple iterative kernel

This threshold is not the same as cache blocking. It is usually chosen to reduce recursion overhead and enable compiler optimization.

For high-performance matrix multiplication, the base kernel may be a carefully optimized microkernel.

When Cache-Oblivious Design Helps

Cache-oblivious techniques are most useful when:

data is multidimensional
access patterns are nonsequential
subproblems can be recursively partitioned
the same data is reused many times
portability across cache hierarchies matters

Common examples:

matrix transpose
matrix multiplication
FFT
sorting
search trees
dynamic programming tables

When It Does Not Help

Cache-oblivious recursion is not magic.

It may not help when:

the access pattern is already sequential
the data does not fit well into recursive partitions
pointer chasing dominates
branch misprediction dominates
recursion overhead overwhelms locality gains
compiler vectorization is inhibited

A simple loop over an array is already close to ideal for cache behavior.

Do not replace it with recursion unless measurement supports the change.

Common Mistakes

Confusing Cache-Oblivious with Cache-Ignorant

Cache-oblivious algorithms are designed around locality. They simply avoid hard-coding cache parameters.

Ignoring locality entirely is different.

Recursing Too Far

Recursive calls have overhead.

Use a base-case kernel.

Using Slices That Allocate

In languages where subarray operations allocate or copy data, recursive decomposition may create hidden overhead.

Use index ranges or views.

Assuming Better Big O

Cache-oblivious design usually improves memory transfers, not arithmetic complexity.

Matrix multiplication remains (O(n^3)) unless paired with an algorithm such as Strassen.

Ignoring Layout

Recursive access patterns help only if the data layout supports locality.

Pointer-heavy layouts can still perform poorly.

Discussion

Cache-oblivious algorithms show that divide and conquer is not only a tool for reducing operation counts. It can also control data movement. On modern machines, that distinction is crucial. A theoretically optimal number of arithmetic operations may still perform badly if it moves data through memory inefficiently.

Recursive decomposition gives a portable way to expose locality. As subproblems shrink, they eventually fit into each cache level without the algorithm knowing the hardware parameters. The same structure can benefit L1 cache, L2 cache, L3 cache, and external memory.

The practical lesson is measured restraint. Use cache-oblivious design where locality is the bottleneck and recursive partitioning matches the data structure. Keep base cases simple. Avoid hidden allocations. Benchmark against well-tuned iterative code. When used well, cache-oblivious algorithms make divide and conquer relevant not only to asymptotic analysis, but to the physical cost model of real computers.