Linearization is the operation of replacing a nonlinear function by its best local linear approximation at a chosen point. Automatic differentiation can be understood as a...
Linearization is the operation of replacing a nonlinear function by its best local linear approximation at a chosen point. Automatic differentiation can be understood as a machine for computing this local linear approximation, or products involving it, for programs.
Let
At a point , the value of the function is
If the input is perturbed by a small vector , the output changes by
For differentiable , the first-order approximation is
The linear map
is the linearization of at .
Local Linear Models
A nonlinear function may have complicated global behavior. Near one point, however, a differentiable function behaves like an affine map:
The affine approximation has two parts:
| Part | Meaning |
|---|---|
| base value | |
| first-order change around the base value |
The Jacobian itself is linear in the perturbation , but the full approximation includes the offset . This is why we distinguish between a linear map and an affine approximation.
For scalar functions,
linearization gives the familiar tangent line:
For vector functions, the tangent line generalizes to a tangent linear map.
Linearization of a Program
Consider the program
u = x * y
v = sin(u)
z = v + xThe corresponding function is
Linearization introduces perturbations for every value:
The tangent program is obtained by differentiating each primitive operation locally:
The pair of programs,
u = x * y
du = y * dx + x * dy
v = sin(u)
dv = cos(u) * du
z = v + x
dz = dv + dxcomputes both the original value and its first-order change. This is forward mode AD in its most direct form.
Pushforward
The linearization of a function at a point is also called its pushforward.
For
the pushforward at maps tangent vectors at to tangent vectors at :
In Euclidean spaces, this is just multiplication by the Jacobian:
Forward mode AD computes pushforwards. Given a primal value and a tangent vector , it computes
This is the value of the function and the pushed-forward tangent.
Linearization as a Program Transform
Linearization can be described as a transformation on programs.
Given a program that computes
the linearized program computes
where
This transformed program runs alongside the original computation. Each primitive is replaced by a paired primal-and-tangent rule.
For example, addition becomes:
Multiplication becomes:
Sine becomes:
Exponential becomes:
The transformed program remains executable ordinary code. This is one reason AD fits naturally into compilers and programming languages.
Linearization and Approximation Error
Linearization keeps only first-order terms. The approximation error is higher order in the perturbation.
For a scalar function with sufficient smoothness,
For a vector function,
The notation means that the ignored error shrinks quadratically as the perturbation norm goes to zero.
AD computes the first-order term exactly according to the executed operations and floating point arithmetic. It does not estimate the derivative by trying small perturbations. That separates AD from finite differences.
Linearization vs Finite Differences
Finite differences approximate derivatives by evaluating the function at nearby points:
This approximation depends on the step size . If is too large, truncation error dominates. If is too small, floating point cancellation dominates.
Forward mode AD computes the same directional derivative quantity,
but without subtracting nearly equal function values. It propagates the derivative algebra through each primitive operation.
This makes AD much more accurate than finite differences for many computations, while retaining a program-execution model rather than symbolic expression manipulation.
Linearization and Composition
Linearization respects composition.
If
then
Applying this to a perturbation ,
This is exactly the operational behavior of forward mode:
- Push through .
- Push the resulting tangent through .
So the linearization transform is compositional. A large program can be linearized by linearizing its parts and wiring the tangent values through the same dependency graph.
Linearization of Control Flow
For executed control flow, linearization follows the path taken by the primal computation.
Consider
if x > 0:
y = x * x
else:
y = -xFor , the executed branch gives
and the tangent rule is
For , the executed branch gives
and the tangent rule is
At , the function has a kink, so the classical derivative does not exist. An AD system may still return a value depending on which branch executes and how the primitive is defined.
This is an important practical point. AD differentiates the executed program path. It does not automatically replace non-smooth code with generalized calculus unless such rules are explicitly built into the system.
Linearization of Loops
Loops can be linearized by applying tangent propagation to each iteration.
Suppose a recurrence is written as
where is state and is a parameter vector. The tangent recurrence is
Forward mode propagates this tangent state through time together with the primal state.
This is useful for sensitivity analysis. If has a small number of components, forward propagation can compute how the trajectory changes under parameter perturbations.
Matrix-Free Linearization
In large systems, the Jacobian may be too large to materialize.
For example, if
then the Jacobian has entries. Storing it explicitly is infeasible.
But computing
may still be feasible. The result has only entries, the same shape as the output. Forward mode gives a matrix-free representation of the Jacobian action.
Many numerical algorithms need only this action. Krylov methods, sensitivity methods, implicit solvers, and stability analysis often work through products rather than explicit derivative matrices.
Linearization in AD Systems
Different AD systems expose linearization in different ways.
A library may provide a function like:
jvp(f, x, v) -> (f(x), J_f(x)v)A compiler may transform code into a paired primal-tangent program.
A tracing system may record a graph and attach tangent propagation rules to graph nodes.
Despite implementation differences, the mathematical object is the same:
This object is the forward-mode derivative operator.
Role in the Rest of AD
Linearization is the foundation for several later ideas.
Forward mode is direct evaluation of a linearized program. Reverse mode can be understood as transposing a linearized program. Higher-order AD differentiates linearized programs again. Implicit differentiation uses linearized equations. Numerical stability analysis studies how perturbations move through the linearized computation.
The important distinction is this:
The original program maps values to values. The linearized program maps values and input perturbations to values and output perturbations.
Once this distinction is clear, many AD mechanisms become systematic rather than mysterious.