Taylor Expansions

Differentiation describes how a function changes locally. A Taylor expansion extends this idea by approximating a function with a polynomial around a point.

Automatic differentiation is fundamentally a first-order method in most applications, but higher-order AD is closely connected to Taylor expansions. Forward mode can be generalized to propagate higher-order coefficients, and many derivative identities become clearer through the Taylor viewpoint.

Local Polynomial Approximation

Let

f : \mathbb{R} \to \mathbb{R}

be sufficiently smooth near a point $x$ . The Taylor expansion around $x$ is

f(x + h) = f(x) + f'(x)h + \frac{1}{2}f''(x)h^2 + \frac{1}{6}f^{(3)}(x)h^3 + \cdots

More compactly,

f(x+h) = \sum_{k=0}^{\infty} \frac{f^{(k)}(x)}{k!}h^k

The coefficients are derivatives evaluated at the expansion point.

The first-order approximation is

f(x+h) \approx f(x) + f'(x)h

This is the linearization used by ordinary AD.

The second-order approximation adds curvature:

f(x+h) \approx f(x) + f'(x)h + \frac{1}{2}f''(x)h^2

Higher-order terms capture increasingly fine local structure.

Taylor Expansion in Multiple Variables

For

f : \mathbb{R}^n \to \mathbb{R}

the expansion around $x$ in direction $\Delta x$ begins as

f(x + \Delta x) = f(x) + \nabla f(x)^T\Delta x + \frac{1}{2} \Delta x^T H_f(x)\Delta x + \cdots

The terms have geometric meaning:

Term	Meaning
$f(x)$	base value
$\nabla f(x)^T\Delta x$	first-order directional change
$\frac{1}{2}\Delta x^T H_f(x)\Delta x$	second-order curvature
higher-order terms	finer local structure

The first-order term is linear in $\Delta x$ . The second-order term is quadratic.

Forward mode computes first-order directional information directly. Higher-order AD computes additional coefficients of this expansion.

Example

Consider

f(x) = e^x

Its derivatives satisfy

f^{(k)}(x) = e^x

for all $k$ .

The Taylor expansion around $x=0$ is

e^h = 1 + h + \frac{1}{2}h^2 + \frac{1}{6}h^3 + \cdots

For small $h$ , truncating after a few terms gives an accurate approximation.

Now consider

f(x) = \sin x

At $x=0$ ,

\sin h = h - \frac{1}{6}h^3 + \frac{1}{120}h^5 - \cdots

The derivative information determines the full local series structure.

Truncated Series

Practical Taylor methods use truncated series.

A truncated series of order $p$ has the form

a_0 + a_1h + a_2h^2 + \cdots + a_ph^p

Terms above degree $p$ are discarded.

For example, second-order truncation gives

f(x+h) \approx f(x) + f'(x)h + \frac{1}{2}f''(x)h^2

Truncated series behave algebraically. They can be added, multiplied, composed, and differentiated while discarding higher-order terms.

This algebraic structure is the basis of higher-order forward AD.

Dual Numbers as First-Order Taylor Series

Ordinary dual numbers use an infinitesimal $\varepsilon$ satisfying

\varepsilon^2 = 0

A dual number has the form

a + b\varepsilon

This is exactly a truncated first-order Taylor expansion.

x \mapsto x + \varepsilon

then

f(x+\varepsilon) = f(x) + f'(x)\varepsilon

because higher-order terms vanish.

Forward mode AD with dual numbers therefore computes first-order Taylor coefficients automatically.

Higher-Order Truncated Algebras

To obtain higher derivatives, we generalize the nilpotent rule.

Suppose

\varepsilon^{p+1} = 0

Then a truncated series becomes

a_0 + a_1\varepsilon + a_2\varepsilon^2 + \cdots + a_p\varepsilon^p

Applying a smooth function produces higher-order derivative coefficients automatically.

For example, using third-order truncation,

f(x+\varepsilon) = f(x) + f'(x)\varepsilon + \frac{1}{2}f''(x)\varepsilon^2 + \frac{1}{6}f^{(3)}(x)\varepsilon^3

This idea leads to Taylor-mode AD.

Taylor Mode Automatic Differentiation

Taylor mode propagates polynomial coefficients rather than single tangents.

Instead of storing

(x, \dot{x})

we store

(x_0, x_1, x_2, \ldots, x_p)

where

Coefficient	Meaning
$x_0$	primal value
$x_1$	first derivative coefficient
$x_2$	second derivative coefficient
$\vdots$	higher-order coefficients

Primitive operations are lifted to operate on truncated series.

For addition:

(a_0 + a_1\varepsilon + \cdots) + (b_0 + b_1\varepsilon + \cdots)

gives

(a_0+b_0) + (a_1+b_1)\varepsilon + \cdots

For multiplication:

(a_0 + a_1\varepsilon + \cdots) (b_0 + b_1\varepsilon + \cdots)

coefficients combine through polynomial convolution.

This allows one forward pass to compute multiple derivative orders simultaneously.

Composition and Taylor Series

The chain rule appears naturally inside Taylor expansions.

Suppose

f(x) = h(g(x))

Expand $g$ around $x$ :

g(x+h) = g(x) + g'(x)h + \cdots

Substitute into the Taylor expansion of $h$ :

h(g(x+h)) = h(g(x)) + h'(g(x))g'(x)h + \cdots

The first-order term already contains the chain rule:

(h \circ g)'(x) = h'(g(x))g'(x)

Higher-order terms produce generalized chain-rule formulas such as Faà di Bruno expansions.

Taylor methods therefore provide another perspective on derivative composition.

Hessians from Taylor Expansions

The Hessian appears as the quadratic coefficient in multivariate expansions.

For

f(x+\Delta x) = f(x) + \nabla f(x)^T\Delta x + \frac{1}{2}\Delta x^T H_f(x)\Delta x + \cdots

the Hessian determines second-order behavior.

If we choose a direction $v$ and define

g(t) = f(x+tv)

then

g''(0) = v^T H_f(x)v

This is the second directional derivative along $v$ .

Many higher-order AD algorithms operate through directional Taylor coefficients rather than explicit Hessian matrices.

Taylor Expansions and Numerical Analysis

Taylor expansions are central in numerical methods.

Area	Use of Taylor expansions
Optimization	local quadratic models
ODE solvers	local time stepping
Stability analysis	perturbation growth
Root finding	Newton and higher-order methods
Error estimation	truncation analysis
Sensitivity analysis	local parameter dependence

AD provides exact derivative information needed for these methods without symbolic differentiation.

Radius of Validity

A Taylor expansion is local.

The approximation quality depends on:

Distance from the expansion point.
Smoothness of the function.
Order of truncation.

Near singularities or discontinuities, the expansion may fail.

For example,

f(x) = \frac{1}{1-x}

has Taylor expansion around $x=0$ :

1 + x + x^2 + x^3 + \cdots

This converges only for

|x| < 1

Outside that radius, the series diverges.

AD computes derivatives locally. It does not guarantee that low-order local approximations remain accurate globally.

Taylor Expansions and Floating Point

Taylor coefficients are mathematically exact derivatives, but practical computation uses floating point arithmetic.

Higher-order derivatives can become numerically unstable:

Issue	Effect
cancellation	loss of precision
large factorial scaling	overflow or underflow
repeated differentiation	amplification of rounding error
high-order tensors	large memory growth

For this reason, many AD systems focus primarily on first-order reverse mode and selected second-order products rather than arbitrary high-order expansions.

Relationship to Reverse Mode

Forward Taylor propagation generalizes naturally to higher orders. Reverse mode becomes much more complicated at higher orders because reverse propagation already represents a transposed linearized computation.

Second-order reverse mode requires differentiating reverse computations themselves. The resulting systems involve nested pullbacks, higher-order adjoints, and large intermediate structures.

This is one reason higher-order forward techniques remain important even in systems dominated by reverse-mode machine learning.

Conceptual View

Taylor expansions provide a unifying interpretation of differentiation.

A differentiable function can be viewed locally as a polynomial approximation:

f(x+h) = \text{constant term} + \text{linear term} + \text{quadratic term} + \cdots

Automatic differentiation computes the coefficients of this local expansion algorithmically from the program structure.

Ordinary forward mode computes the linear term.

Higher-order AD computes additional coefficients.

Reverse mode computes transposed actions of these local linear structures.

In all cases, the program is treated as a composition of primitive operations, and local expansion rules are propagated through the computation.