Linearization

Linearization is the operation of replacing a nonlinear function by its best local linear approximation at a chosen point. Automatic differentiation can be understood as a machine for computing this local linear approximation, or products involving it, for programs.

Let

f : \mathbb{R}^n \to \mathbb{R}^m

At a point $x$ , the value of the function is

y = f(x)

If the input is perturbed by a small vector $\Delta x$ , the output changes by

\Delta y = f(x + \Delta x) - f(x)

For differentiable $f$ , the first-order approximation is

f(x + \Delta x) \approx f(x) + J_f(x)\Delta x

The linear map

\Delta x \mapsto J_f(x)\Delta x

is the linearization of $f$ at $x$ .

Local Linear Models

A nonlinear function may have complicated global behavior. Near one point, however, a differentiable function behaves like an affine map:

x + \Delta x \mapsto f(x) + J_f(x)\Delta x

The affine approximation has two parts:

Part	Meaning
$f(x)$	base value
$J_f(x)\Delta x$	first-order change around the base value

The Jacobian itself is linear in the perturbation $\Delta x$ , but the full approximation includes the offset $f(x)$ . This is why we distinguish between a linear map and an affine approximation.

For scalar functions,

f : \mathbb{R} \to \mathbb{R}

linearization gives the familiar tangent line:

f(x + \Delta x) \approx f(x) + f'(x)\Delta x

For vector functions, the tangent line generalizes to a tangent linear map.

Linearization of a Program

Consider the program

u = x * y
v = sin(u)
z = v + x

The corresponding function is

z = \sin(xy) + x

Linearization introduces perturbations for every value:

x \mapsto x + \dot{x}

y \mapsto y + \dot{y}

u \mapsto u + \dot{u}

v \mapsto v + \dot{v}

z \mapsto z + \dot{z}

The tangent program is obtained by differentiating each primitive operation locally:

u = xy

\dot{u} = y\dot{x} + x\dot{y}

v = \sin u

\dot{v} = \cos(u)\dot{u}

z = v + x

\dot{z} = \dot{v} + \dot{x}

The pair of programs,

u  = x * y
du = y * dx + x * dy

v  = sin(u)
dv = cos(u) * du

z  = v + x
dz = dv + dx

computes both the original value and its first-order change. This is forward mode AD in its most direct form.

Pushforward

The linearization of a function at a point is also called its pushforward.

For

f : X \to Y

the pushforward at $x$ maps tangent vectors at $x$ to tangent vectors at $f(x)$ :

f_{*x} : T_x X \to T_{f(x)}Y

In Euclidean spaces, this is just multiplication by the Jacobian:

f_{*x}(v) = J_f(x)v

Forward mode AD computes pushforwards. Given a primal value $x$ and a tangent vector $v$ , it computes

(f(x), J_f(x)v)

This is the value of the function and the pushed-forward tangent.

Linearization as a Program Transform

Linearization can be described as a transformation on programs.

Given a program that computes

y = f(x)

the linearized program computes

(y, \dot{y}) = \operatorname{lin}(f)(x, \dot{x})

where

\dot{y} = J_f(x)\dot{x}

This transformed program runs alongside the original computation. Each primitive is replaced by a paired primal-and-tangent rule.

For example, addition becomes:

z = x + y

\dot{z} = \dot{x} + \dot{y}

Multiplication becomes:

z = xy

\dot{z} = y\dot{x} + x\dot{y}

Sine becomes:

z = \sin x

\dot{z} = \cos(x)\dot{x}

Exponential becomes:

z = e^x

\dot{z} = e^x\dot{x}

The transformed program remains executable ordinary code. This is one reason AD fits naturally into compilers and programming languages.

Linearization and Approximation Error

Linearization keeps only first-order terms. The approximation error is higher order in the perturbation.

For a scalar function with sufficient smoothness,

f(x + \Delta x) = f(x) + f'(x)\Delta x + O((\Delta x)^2)

For a vector function,

f(x + \Delta x) = f(x) + J_f(x)\Delta x + O(\|\Delta x\|^2)

The notation $O(\|\Delta x\|^2)$ means that the ignored error shrinks quadratically as the perturbation norm goes to zero.

AD computes the first-order term exactly according to the executed operations and floating point arithmetic. It does not estimate the derivative by trying small perturbations. That separates AD from finite differences.

Linearization vs Finite Differences

Finite differences approximate derivatives by evaluating the function at nearby points:

\frac{f(x + \epsilon v) - f(x)}{\epsilon} \approx J_f(x)v

This approximation depends on the step size $\epsilon$ . If $\epsilon$ is too large, truncation error dominates. If $\epsilon$ is too small, floating point cancellation dominates.

Forward mode AD computes the same directional derivative quantity,

J_f(x)v

but without subtracting nearly equal function values. It propagates the derivative algebra through each primitive operation.

This makes AD much more accurate than finite differences for many computations, while retaining a program-execution model rather than symbolic expression manipulation.

Linearization and Composition

Linearization respects composition.

f = h \circ g

then

J_f(x) = J_h(g(x))J_g(x)

Applying this to a perturbation $v$ ,

J_f(x)v = J_h(g(x)) \bigl( J_g(x)v \bigr)

This is exactly the operational behavior of forward mode:

Push $v$ through $g$ .
Push the resulting tangent through $h$ .

So the linearization transform is compositional. A large program can be linearized by linearizing its parts and wiring the tangent values through the same dependency graph.

Linearization of Control Flow

For executed control flow, linearization follows the path taken by the primal computation.

Consider

if x > 0:
    y = x * x
else:
    y = -x

For $x > 0$ , the executed branch gives

y = x^2

and the tangent rule is

\dot{y} = 2x\dot{x}

For $x < 0$ , the executed branch gives

y = -x

and the tangent rule is

\dot{y} = -\dot{x}

At $x = 0$ , the function has a kink, so the classical derivative does not exist. An AD system may still return a value depending on which branch executes and how the primitive is defined.

This is an important practical point. AD differentiates the executed program path. It does not automatically replace non-smooth code with generalized calculus unless such rules are explicitly built into the system.

Linearization of Loops

Loops can be linearized by applying tangent propagation to each iteration.

Suppose a recurrence is written as

s_{t+1} = F(s_t, \theta)

where $s_t$ is state and $\theta$ is a parameter vector. The tangent recurrence is

\dot{s}_{t+1} = \frac{\partial F}{\partial s}(s_t, \theta)\dot{s}_t + \frac{\partial F}{\partial \theta}(s_t, \theta)\dot{\theta}

Forward mode propagates this tangent state through time together with the primal state.

This is useful for sensitivity analysis. If $\theta$ has a small number of components, forward propagation can compute how the trajectory changes under parameter perturbations.

Matrix-Free Linearization

In large systems, the Jacobian may be too large to materialize.

For example, if

f : \mathbb{R}^{10^7} \to \mathbb{R}^{10^7}

then the Jacobian has $10^{14}$ entries. Storing it explicitly is infeasible.

But computing

J_f(x)v

may still be feasible. The result has only $10^7$ entries, the same shape as the output. Forward mode gives a matrix-free representation of the Jacobian action.

Many numerical algorithms need only this action. Krylov methods, sensitivity methods, implicit solvers, and stability analysis often work through products rather than explicit derivative matrices.

Linearization in AD Systems

Different AD systems expose linearization in different ways.

A library may provide a function like:

jvp(f, x, v) -> (f(x), J_f(x)v)

A compiler may transform code into a paired primal-tangent program.

A tracing system may record a graph and attach tangent propagation rules to graph nodes.

Despite implementation differences, the mathematical object is the same:

(x, v) \mapsto (f(x), J_f(x)v)

This object is the forward-mode derivative operator.

Role in the Rest of AD

Linearization is the foundation for several later ideas.

Forward mode is direct evaluation of a linearized program. Reverse mode can be understood as transposing a linearized program. Higher-order AD differentiates linearized programs again. Implicit differentiation uses linearized equations. Numerical stability analysis studies how perturbations move through the linearized computation.

The important distinction is this:

The original program maps values to values. The linearized program maps values and input perturbations to values and output perturbations.

Once this distinction is clear, many AD mechanisms become systematic rather than mysterious.