Skip to content

Complexity of Higher Orders

Higher-order automatic differentiation faces a fundamental problem: derivative structure grows combinatorially with order.

Higher-order automatic differentiation faces a fundamental problem: derivative structure grows combinatorially with order.

For a function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the first derivative is a vector, the second derivative is a matrix, and higher derivatives become tensors.

The size growth is severe:

OrderObjectDense size
1gradientnn
2Hessiann2n^2
3third-order tensorn3n^3
4fourth-order tensorn4n^4

This growth is not an implementation artifact. It is an intrinsic property of higher-order multilinear structure.

Efficient higher-order AD therefore depends on avoiding explicit tensor materialization whenever possible.

Derivative Tensors

The kk-th derivative of a scalar function is a symmetric multilinear map:

Dkf(x):(Rn)kR. D^k f(x) : (\mathbb{R}^n)^k \to \mathbb{R}.

In coordinates:

(Dkf(x))i1ik=kfxi1xik. (D^k f(x))_{i_1 \cdots i_k} = \frac{ \partial^k f }{ \partial x_{i_1}\cdots\partial x_{i_k} }.

Without symmetry, this tensor has:

nk n^k

entries.

For even moderate values:

nnkkentries
100210410^4
100310610^6
100410810^8
1000310910^9

Explicit storage rapidly becomes infeasible.

Symmetry Reduces but Does Not Eliminate Growth

For smooth scalar functions, higher-order derivative tensors are symmetric under permutation of indices.

The number of unique entries becomes:

(n+k1k). \binom{n+k-1}{k}.

Examples:

OrderSymmetric entries
2n(n+1)2\frac{n(n+1)}{2}
3(n+23)\binom{n+2}{3}
4(n+34)\binom{n+3}{4}

For fixed kk, this scales approximately as:

O(nk/k!). O(n^k/k!).

Symmetry helps substantially, but the growth remains polynomial of degree kk.

Forward Mode Complexity

Forward mode propagates tangent information alongside primal values.

For first-order derivatives:

QuantityCost
one directional derivativeO(C)O(C)
full JacobianO(nC)O(nC)

where CC is the cost of evaluating the primal program.

For higher orders, nested forward mode introduces higher-dimensional tangent structure.

A naïve kk-th order forward representation may require:

O(nk) O(n^k)

derivative components per intermediate value.

This creates both computation and storage growth.

Taylor Mode Complexity

Taylor mode computes truncated series coefficients.

Suppose Taylor order is:

p. p.

Each intermediate variable stores:

a0,a1,,ap. a_0, a_1, \ldots, a_p.

For multiplication, coefficients satisfy convolution recurrences:

ck=i=0kaibki. c_k = \sum_{i=0}^k a_i b_{k-i}.

A single multiplication therefore costs:

O(p2). O(p^2).

If the primal program costs CC, Taylor mode typically costs approximately:

O(p2C). O(p^2 C).

Memory cost is:

O(pN), O(pN),

where NN is the number of intermediates.

This is much better than explicit tensor construction for directional higher derivatives.

Reverse Mode Complexity

Reverse mode computes gradients efficiently for scalar-output functions.

For first order:

QuantityCost
gradientO(C)O(C)
memoryproportional to saved intermediates

Higher-order reverse mode is harder.

Differentiating the backward pass introduces:

SourceComplexity growth
nested tapesadditional storage
adjoint differentiationlarger programs
saved residualsincreased memory
recomputationincreased runtime
mutation trackingbookkeeping overhead

In the worst case, higher-order reverse mode may grow exponentially in derivative order if implemented naïvely.

Practical systems rely heavily on checkpointing, operator formulations, and mixed-mode transforms.

Hessian Complexity

The Hessian is the first practically important higher-order object.

Dense Hessian complexity:

QuantityComplexity
storageO(n2)O(n^2)
explicit constructionO(n2)O(n^2) output
Hessian-vector productO(C)O(C)

The gap between explicit Hessians and Hessian-vector products is critical.

Suppose:

n=106. n = 10^6.

Then:

ObjectApproximate size
gradient10610^6 entries
Hessian101210^{12} entries

The Hessian is impossible to materialize densely. But Hessian-vector products remain practical.

This motivates matrix-free methods.

Operator Complexity

Most scalable higher-order methods expose derivative operators rather than tensors.

Examples:

OperatorMeaning
JvJvJacobian-vector product
vJv^\top Jvector-Jacobian product
HvHvHessian-vector product
Dkf[v,,v]D^k f[v,\ldots,v]directional kth derivative

The complexity depends on the number of directions rather than the full tensor size.

This often reduces exponential tensor growth to linear or polynomial directional growth.

Complexity of Directional Derivatives

A directional derivative compresses tensor structure into one direction.

For example:

Dkf(x)[v,,v]. D^k f(x)[v,\ldots,v].

This scalar quantity avoids storing the full tensor.

Taylor mode computes repeated directional derivatives efficiently:

QuantityApproximate cost
order pp directional expansionO(p2C)O(p^2 C)

This is far smaller than:

O(np) O(n^p)

tensor construction.

Sparse Higher-Order Structure

Many programs have sparse derivative structure.

Suppose variables interact only locally in the computational graph.

Then many higher-order tensor entries vanish.

Sparse methods reduce complexity using:

MethodBenefit
dependency analysisavoid zero entries
graph coloringreduce directional passes
compressed recoveryreconstruct sparse tensors
block decompositionlocalized interactions
symbolic sparsity propagationstructural inference

For structured scientific problems, sparsity often changes the practical complexity class completely.

Complexity of Mixed Modes

Mixed-mode AD combines forward and reverse transforms strategically.

Example:

OperationEfficient mode
gradientreverse
Jacobian-vector productforward
Hessian-vector productforward-over-reverse
vector-Hessian-vectornested directional
directional kth derivativeTaylor mode

The complexity depends strongly on the input-output geometry.

Suppose:

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

Then:

ModeCost scaling
forwardproportional to input directions
reverseproportional to output directions

Efficient higher-order systems choose the nesting that minimizes active derivative dimensionality.

Memory Complexity

Higher-order AD often becomes memory-bound before compute-bound.

A reverse-mode system must save intermediate values for the backward pass.

Higher-order reverse mode may need:

Memory objectPurpose
primal valuesbackward rules
tangentsnested forward levels
cotangentsnested reverse levels
tapesreplay structure
residualssaved intermediates
checkpoint metadatarecomputation planning

Memory can grow rapidly with nesting depth.

Checkpointing reduces storage at the cost of recomputation.

Checkpointing Complexity

Suppose a program has length:

N. N.

Naïve reverse mode stores all intermediates:

O(N) O(N)

memory.

Checkpointing trades storage for recomputation.

For higher-order AD, checkpointing becomes recursive.

The optimal tradeoff depends on:

ParameterInfluence
nesting depthderivative levels
recomputation costreplay overhead
tape sizememory pressure
residual sizesaved state
hardware bandwidthmemory throughput

Advanced checkpointing schedules can reduce memory asymptotically.

Curse of Dimensionality

Higher-order derivatives suffer from the curse of dimensionality.

For dimension nn and derivative order kk:

tensor sizenk. \text{tensor size} \sim n^k.

This is unavoidable for arbitrary dense tensors.

Efficient methods therefore rely on structure:

StructureCompression
symmetryremove permutations
sparsityremove zeros
low rankfactorize tensors
directional evaluationavoid explicit tensors
localityblock decomposition
approximate geometrycompressed representations

Without structure, high-order derivatives become computationally impossible at scale.

Neural Networks

Large neural networks illustrate these limits clearly.

Suppose a model has:

109 10^9

parameters.

Then:

QuantitySize
gradientfeasible
Hessian diagonaldifficult
dense Hessianimpossible
third-order tensormeaningless computationally

Practical deep learning therefore uses:

TechniqueReason
gradientsscalable
Hessian-vector productsscalable curvature
low-rank approximationscompressed geometry
Fisher approximationsprobabilistic structure
diagonal approximationsmemory reduction

Full higher-order tensors are almost never constructed.

Complexity Lower Bounds

Some costs are unavoidable.

A dense Hessian has:

O(n2) O(n^2)

output size.

No algorithm can output all entries in subquadratic time.

Similarly, a dense kk-th derivative tensor requires:

O(nk) O(n^k)

output size.

Efficient AD methods do not defeat these lower bounds. They avoid them by changing the computational goal.

Instead of asking for the tensor itself, they ask for:

Reduced goalComplexity
tensor contractionlower
directional derivativelower
operator applicationlower
sparse recoverystructure-dependent
local approximationcompressed

This shift is fundamental.

Practical Complexity Rules

A useful engineering summary is:

TaskScalable?Preferred representation
gradientyesvector
Jacobiansometimessparse/operator
Hessianrarely denseoperator
third-order tensorusually nocontractions
repeated directional derivativesyesTaylor mode
implicit higher-order derivativesoftenlinear solves

Efficient higher-order AD is primarily about selecting the correct representation.

Conceptual Perspective

Higher-order derivatives encode local multilinear geometry.

The geometry itself grows combinatorially with dimension and order. Automatic differentiation cannot eliminate that mathematical reality.

What AD can do is preserve only the derivative information actually needed by downstream algorithms.

Scalable higher-order AD therefore depends less on raw differentiation mechanics and more on representation theory:

Which parts of the multilinear structure must be made explicit? \text{Which parts of the multilinear structure must be made explicit?}

That question determines the true complexity of higher-order automatic differentiation.