Higher-Dimensional Tangent Spaces

So far, forward mode has propagated a single tangent direction:

$$ x \mapsto (x, \dot{x}), $$

where

$$ \dot{x} \in \mathbb{R}. $$

This computes one directional derivative:

$$ J_f(x)v. $$

However, many applications require several directional derivatives simultaneously. Instead of propagating one tangent scalar, we can propagate an entire tangent vector space.

Each variable becomes

$$ x \mapsto (x, \dot{x}), \qquad \dot{x} \in \mathbb{R}^k. $$

Now every variable carries $k$ tangent components at once.

The resulting computation produces

$$ J_f(x)V, $$

where

$$ V \in \mathbb{R}^{n \times k} $$

contains $k$ seed directions.

This is called higher-dimensional forward mode, vector forward mode, or multidirectional forward mode.

Tangent spaces

For a function

$$ f : \mathbb{R}^n \to \mathbb{R}^m, $$

the derivative at $x$ is the linear map

$$ J_f(x) : \mathbb{R}^n \to \mathbb{R}^m. $$

The vector space

$$ \mathbb{R}^n $$

acts as the tangent space at $x$. A tangent vector represents an infinitesimal perturbation direction.

Scalar forward mode propagates one tangent vector:

$$ v \in \mathbb{R}^n. $$

Higher-dimensional forward mode propagates a collection of tangent vectors simultaneously:

$$ v_1, v_2, \ldots, v_k. $$

Equivalently, it propagates a tangent matrix

$$ V = \begin{bmatrix} | & | & & | \ v_1 & v_2 & \cdots & v_k \ | & | & & | \end{bmatrix}. $$

The output is

$$ J_f(x)V. $$

Each column of the result is one JVP.

From scalar tangents to vector tangents

In scalar forward mode:

$$ x_i \mapsto (x_i, \dot{x}_i), \qquad \dot{x}_i \in \mathbb{R}. $$

In vector forward mode:

$$ x_i \mapsto (x_i, \dot{x}_i), \qquad \dot{x}_i \in \mathbb{R}^k. $$

For addition:

$$ z = x + y, $$

the tangent rule becomes

$$ \dot{z} = \dot{x} + \dot{y}, $$

where all tangents are vectors in $\mathbb{R}^k$.

For multiplication:

$$ z = xy, $$

the tangent rule becomes

$$ \dot{z} = y\dot{x} + x\dot{y}. $$

The scalars $x$ and $y$ multiply every tangent component.

Thus each primitive lifts naturally from scalar tangents to vector tangents.

Example: two tangent directions

Consider

$$ f(x,y) = \begin{bmatrix} xy \ x+y \end{bmatrix}. $$

Suppose we want derivatives in two directions:

$$ v_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0 \ 1 \end{bmatrix}. $$

These are the standard basis directions.

The tangent matrix is

$$ V = \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix}. $$

Seed the inputs:

$$ \dot{x} = [1,0], \qquad \dot{y} = [0,1]. $$

Now propagate.

First output:

$$ f_1 = xy. $$

Its tangent:

$$ \dot{f}_1 = y\dot{x} + x\dot{y}. $$

Substitute:

$$ \dot{f}_1 = [y, x]. $$

Second output:

$$ f_2 = x+y. $$

Its tangent:

$$ \dot{f}_2 = \dot{x} + \dot{y} = [1,1]. $$

Collect results:

$$ J_f(x,y)V = \begin{bmatrix} y & x \ 1 & 1 \end{bmatrix}. $$

Because the seed matrix was the identity, the output equals the full Jacobian.

Full Jacobians in one pass

$$ V = I_n, $$

then

$$ J_f(x)V = J_f(x). $$

Thus vector forward mode can compute the full Jacobian in one pass.

However, every variable now carries an $n$-dimensional tangent vector. If the input dimension is large, this becomes expensive.

The memory cost becomes

$$ O(nM_f), $$

and the arithmetic cost becomes

$$ O(nC_f). $$

So this strategy is practical only when $n$ is moderate or when the Jacobian has exploitable structure.

Matrix interpretation

Forward propagation with $k$-dimensional tangents can be viewed as propagating a local linear map.

Suppose a primitive operation has local Jacobian

$$ A. $$

Instead of multiplying a vector,

$$ Av, $$

we now multiply a matrix:

$$ AV. $$

Each primitive therefore propagates several tangent directions simultaneously.

The entire computation graph becomes a sequence of matrix propagations:

$$ V \mapsto A_1V \mapsto A_2A_1V \mapsto \cdots \mapsto J_f(x)V. $$

Scalar forward mode is the special case $k=1$.

SIMD and batched execution

Vector forward mode maps naturally onto modern hardware.

If tangent vectors are packed into contiguous arrays, many tangent operations become vectorizable:

Operation	SIMD behavior
addition	vector add
multiplication	fused vector multiply-add
transcendental functions	batched evaluation
tensor primitives	batched kernels

For example, if

$$ \dot{x} \in \mathbb{R}^4, $$

a CPU SIMD register may compute all four tangent components simultaneously.

On GPUs, tangent dimensions can often be batched across tensor operations.

Thus higher-dimensional tangent spaces may achieve better hardware utilization than repeated scalar forward passes.

Sparse tangent spaces

Large tangent vectors are often sparse.

Suppose a function depends locally on its inputs. Many tangent components remain zero throughout the computation.

Example:

$$ f(x_1,\ldots,x_n) = x_i x_j. $$

Only tangent components for $x_i$ and $x_j$ contribute.

Instead of storing dense tangent vectors, a sparse representation stores only nonzero entries:

type SparseTangent struct {
    Indices []int
    Values  []float64
}

This can reduce memory and arithmetic cost dramatically for sparse derivative structures.

Sparse forward mode is especially important for:

sparse Jacobians,
PDE systems,
graph computations,
circuit simulation,
large optimization problems.

Block tangent propagation

Some systems use block tangents instead of individual tangent vectors.

Suppose variables are partitioned into blocks:

$$ x = (x^{(1)}, x^{(2)}, \ldots). $$

Each block carries its own tangent subspace.

This gives block-Jacobian propagation:

$$ J_f(x) = \begin{bmatrix} J_{11} & J_{12} \ J_{21} & J_{22} \end{bmatrix}. $$

Block methods improve locality and reduce overhead when derivatives naturally cluster into subsystems.

Examples:

Domain	Natural blocks
robotics	joints or limbs
PDE solvers	spatial regions
graphics	object groups
optimization	parameter groups
databases	partitioned relations

Hyper-dual interpretation

Higher-dimensional tangents can also be expressed algebraically.

Scalar forward mode uses dual numbers:

$$ x + \epsilon \dot{x}, \qquad \epsilon^2 = 0. $$

Vector forward mode introduces multiple nilpotent generators:

$$ x + \sum_{i=1}^{k} \epsilon_i \dot{x}_i, $$

with

$$ \epsilon_i^2 = 0, \qquad \epsilon_i\epsilon_j = 0. $$

Each generator corresponds to one tangent direction.

This algebra represents a first-order tangent space with $k$ independent basis directions.

More advanced systems relax the cross-term condition and allow:

$$ \epsilon_i\epsilon_j \ne 0. $$

Those structures lead to hyper-dual numbers and higher-order differentiation.

Tangent dimension explosion

A major limitation of vector forward mode is tangent growth.

If every variable carries an $n$-dimensional tangent vector, memory traffic can dominate runtime.

Suppose the primal program stores:

$$ 10^8 $$

floating point values.

If each value carries a tangent vector of dimension $1000$, the tangent storage becomes enormous.

This causes:

Problem	Effect
cache pressure	poor locality
memory bandwidth	bottleneck
register pressure	spilling
GPU occupancy loss	reduced parallel efficiency
tensor expansion	large intermediate allocations

Therefore large tangent dimensions require careful engineering.

Compression techniques

Several techniques reduce tangent overhead.

Directional batching

Instead of propagating all directions simultaneously, split them into batches:

$$ V = [V_1 \mid V_2 \mid \cdots]. $$

Each pass computes only a subset of tangent directions.

Sparse compression

Store only active tangent components.

Graph coloring

Exploit Jacobian sparsity to combine independent seed directions into fewer passes.

If two columns of the Jacobian never contribute to the same output row, they can share a seed vector.

This reduces the number of required tangent dimensions.

Low-rank approximation

Some systems approximate tangent spaces using low-rank projections:

$$ V \approx UV_r. $$

This is useful when sensitivities lie near a low-dimensional manifold.

Nested tangent spaces

Higher-dimensional tangent spaces compose naturally.

Suppose each tangent component is itself a dual number:

$$ (x + \epsilon_1 a) + \epsilon_2(b + \epsilon_1 c). $$

This structure propagates higher-order derivatives.

Nested forward mode uses tangent spaces of tangent spaces.

Examples:

Nesting	Result
dual of dual	second derivatives
vector dual of dual	Hessian-vector products
nested vector duals	higher-order tensors

This compositional structure is one reason forward mode is mathematically elegant.

Tangent spaces on manifolds

In Euclidean space, tangents are ordinary vectors.

For manifolds, tangent spaces become geometric objects attached to points.

Example:

Manifold	Tangent space
sphere $S^2$	tangent plane
rotation group $SO(3)$	skew-symmetric matrices
probability simplex	constrained vectors

Forward mode generalizes naturally if primitives define how tangent vectors transform between manifolds.

This becomes important in:

robotics,
computer graphics,
geometric optimization,
physics simulation,
Lie-group dynamics.

Summary

Higher-dimensional tangent spaces generalize forward mode from one directional derivative to many simultaneous directional derivatives. Each variable carries a tangent vector rather than a scalar tangent. The resulting computation propagates a matrix of directions through the computation graph:

$$ J_f(x)V. $$

This allows efficient batched JVPs, full Jacobian construction for moderate input dimensions, and exploitation of hardware vectorization and derivative sparsity. The main challenge is tangent dimension growth, which increases arithmetic cost, memory usage, and bandwidth pressure.