Complexity of Higher Orders

Higher-order automatic differentiation faces a fundamental problem: derivative structure grows combinatorially with order.

For a function

f : \mathbb{R}^n \to \mathbb{R},

the first derivative is a vector, the second derivative is a matrix, and higher derivatives become tensors.

The size growth is severe:

Order	Object	Dense size
1	gradient	$n$
2	Hessian	$n^2$
3	third-order tensor	$n^3$
4	fourth-order tensor	$n^4$

This growth is not an implementation artifact. It is an intrinsic property of higher-order multilinear structure.

Efficient higher-order AD therefore depends on avoiding explicit tensor materialization whenever possible.

Derivative Tensors

The $k$ -th derivative of a scalar function is a symmetric multilinear map:

D^k f(x) : (\mathbb{R}^n)^k \to \mathbb{R}.

In coordinates:

(D^k f(x))_{i_1 \cdots i_k} = \frac{ \partial^k f }{ \partial x_{i_1}\cdots\partial x_{i_k} }.

Without symmetry, this tensor has:

n^k

entries.

For even moderate values:

$n$	$k$	entries
100	2	$10^4$
100	3	$10^6$
100	4	$10^8$
1000	3	$10^9$

Explicit storage rapidly becomes infeasible.

Symmetry Reduces but Does Not Eliminate Growth

For smooth scalar functions, higher-order derivative tensors are symmetric under permutation of indices.

The number of unique entries becomes:

\binom{n+k-1}{k}.

Examples:

Order	Symmetric entries
2	$\frac{n(n+1)}{2}$
3	$\binom{n+2}{3}$
4	$\binom{n+3}{4}$

For fixed $k$ , this scales approximately as:

O(n^k/k!).

Symmetry helps substantially, but the growth remains polynomial of degree $k$ .

Forward Mode Complexity

Forward mode propagates tangent information alongside primal values.

For first-order derivatives:

Quantity	Cost
one directional derivative	$O(C)$
full Jacobian	$O(nC)$

where $C$ is the cost of evaluating the primal program.

For higher orders, nested forward mode introduces higher-dimensional tangent structure.

A naïve $k$ -th order forward representation may require:

O(n^k)

derivative components per intermediate value.

This creates both computation and storage growth.

Taylor Mode Complexity

Taylor mode computes truncated series coefficients.

Suppose Taylor order is:

p.

Each intermediate variable stores:

a_0, a_1, \ldots, a_p.

For multiplication, coefficients satisfy convolution recurrences:

c_k = \sum_{i=0}^k a_i b_{k-i}.

A single multiplication therefore costs:

O(p^2).

If the primal program costs $C$ , Taylor mode typically costs approximately:

O(p^2 C).

Memory cost is:

O(pN),

where $N$ is the number of intermediates.

This is much better than explicit tensor construction for directional higher derivatives.

Reverse Mode Complexity

Reverse mode computes gradients efficiently for scalar-output functions.

For first order:

Quantity	Cost
gradient	$O(C)$
memory	proportional to saved intermediates

Higher-order reverse mode is harder.

Differentiating the backward pass introduces:

Source	Complexity growth
nested tapes	additional storage
adjoint differentiation	larger programs
saved residuals	increased memory
recomputation	increased runtime
mutation tracking	bookkeeping overhead

In the worst case, higher-order reverse mode may grow exponentially in derivative order if implemented naïvely.

Practical systems rely heavily on checkpointing, operator formulations, and mixed-mode transforms.

Hessian Complexity

The Hessian is the first practically important higher-order object.

Dense Hessian complexity:

Quantity	Complexity
storage	$O(n^2)$
explicit construction	$O(n^2)$ output
Hessian-vector product	$O(C)$

The gap between explicit Hessians and Hessian-vector products is critical.

Suppose:

n = 10^6.

Then:

Object	Approximate size
gradient	$10^6$ entries
Hessian	$10^{12}$ entries

The Hessian is impossible to materialize densely. But Hessian-vector products remain practical.

This motivates matrix-free methods.

Operator Complexity

Most scalable higher-order methods expose derivative operators rather than tensors.

Examples:

Operator	Meaning
$Jv$	Jacobian-vector product
$v^\top J$	vector-Jacobian product
$Hv$	Hessian-vector product
$D^k f[v,\ldots,v]$	directional kth derivative

The complexity depends on the number of directions rather than the full tensor size.

This often reduces exponential tensor growth to linear or polynomial directional growth.

Complexity of Directional Derivatives

A directional derivative compresses tensor structure into one direction.

For example:

D^k f(x)[v,\ldots,v].

This scalar quantity avoids storing the full tensor.

Taylor mode computes repeated directional derivatives efficiently:

Quantity	Approximate cost
order $p$ directional expansion	$O(p^2 C)$

This is far smaller than:

O(n^p)

tensor construction.

Sparse Higher-Order Structure

Many programs have sparse derivative structure.

Suppose variables interact only locally in the computational graph.

Then many higher-order tensor entries vanish.

Sparse methods reduce complexity using:

Method	Benefit
dependency analysis	avoid zero entries
graph coloring	reduce directional passes
compressed recovery	reconstruct sparse tensors
block decomposition	localized interactions
symbolic sparsity propagation	structural inference

For structured scientific problems, sparsity often changes the practical complexity class completely.

Complexity of Mixed Modes

Mixed-mode AD combines forward and reverse transforms strategically.

Example:

Operation	Efficient mode
gradient	reverse
Jacobian-vector product	forward
Hessian-vector product	forward-over-reverse
vector-Hessian-vector	nested directional
directional kth derivative	Taylor mode

The complexity depends strongly on the input-output geometry.

Suppose:

f : \mathbb{R}^n \to \mathbb{R}^m.

Then:

Mode	Cost scaling
forward	proportional to input directions
reverse	proportional to output directions

Efficient higher-order systems choose the nesting that minimizes active derivative dimensionality.

Memory Complexity

Higher-order AD often becomes memory-bound before compute-bound.

A reverse-mode system must save intermediate values for the backward pass.

Higher-order reverse mode may need:

Memory object	Purpose
primal values	backward rules
tangents	nested forward levels
cotangents	nested reverse levels
tapes	replay structure
residuals	saved intermediates
checkpoint metadata	recomputation planning

Memory can grow rapidly with nesting depth.

Checkpointing reduces storage at the cost of recomputation.

Checkpointing Complexity

Suppose a program has length:

N.

Naïve reverse mode stores all intermediates:

O(N)

memory.

Checkpointing trades storage for recomputation.

For higher-order AD, checkpointing becomes recursive.

The optimal tradeoff depends on:

Parameter	Influence
nesting depth	derivative levels
recomputation cost	replay overhead
tape size	memory pressure
residual size	saved state
hardware bandwidth	memory throughput

Advanced checkpointing schedules can reduce memory asymptotically.

Curse of Dimensionality

Higher-order derivatives suffer from the curse of dimensionality.

For dimension $n$ and derivative order $k$ :

\text{tensor size} \sim n^k.

This is unavoidable for arbitrary dense tensors.

Efficient methods therefore rely on structure:

Structure	Compression
symmetry	remove permutations
sparsity	remove zeros
low rank	factorize tensors
directional evaluation	avoid explicit tensors
locality	block decomposition
approximate geometry	compressed representations

Without structure, high-order derivatives become computationally impossible at scale.

Neural Networks

Large neural networks illustrate these limits clearly.

Suppose a model has:

10^9

parameters.

Then:

Quantity	Size
gradient	feasible
Hessian diagonal	difficult
dense Hessian	impossible
third-order tensor	meaningless computationally

Practical deep learning therefore uses:

Technique	Reason
gradients	scalable
Hessian-vector products	scalable curvature
low-rank approximations	compressed geometry
Fisher approximations	probabilistic structure
diagonal approximations	memory reduction

Full higher-order tensors are almost never constructed.

Complexity Lower Bounds

Some costs are unavoidable.

A dense Hessian has:

O(n^2)

output size.

No algorithm can output all entries in subquadratic time.

Similarly, a dense $k$ -th derivative tensor requires:

O(n^k)

output size.

Efficient AD methods do not defeat these lower bounds. They avoid them by changing the computational goal.

Instead of asking for the tensor itself, they ask for:

Reduced goal	Complexity
tensor contraction	lower
directional derivative	lower
operator application	lower
sparse recovery	structure-dependent
local approximation	compressed

This shift is fundamental.

Practical Complexity Rules

A useful engineering summary is:

Task	Scalable?	Preferred representation
gradient	yes	vector
Jacobian	sometimes	sparse/operator
Hessian	rarely dense	operator
third-order tensor	usually no	contractions
repeated directional derivatives	yes	Taylor mode
implicit higher-order derivatives	often	linear solves

Efficient higher-order AD is primarily about selecting the correct representation.

Conceptual Perspective

Higher-order derivatives encode local multilinear geometry.

The geometry itself grows combinatorially with dimension and order. Automatic differentiation cannot eliminate that mathematical reality.

What AD can do is preserve only the derivative information actually needed by downstream algorithms.

Scalable higher-order AD therefore depends less on raw differentiation mechanics and more on representation theory:

\text{Which parts of the multilinear structure must be made explicit?}

That question determines the true complexity of higher-order automatic differentiation.