Skip to content

Linearization

Linearization is the operation of replacing a nonlinear function by its best local linear approximation at a chosen point. Automatic differentiation can be understood as a...

Linearization is the operation of replacing a nonlinear function by its best local linear approximation at a chosen point. Automatic differentiation can be understood as a machine for computing this local linear approximation, or products involving it, for programs.

Let

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

At a point xx, the value of the function is

y=f(x) y = f(x)

If the input is perturbed by a small vector Δx\Delta x, the output changes by

Δy=f(x+Δx)f(x) \Delta y = f(x + \Delta x) - f(x)

For differentiable ff, the first-order approximation is

f(x+Δx)f(x)+Jf(x)Δx f(x + \Delta x) \approx f(x) + J_f(x)\Delta x

The linear map

ΔxJf(x)Δx \Delta x \mapsto J_f(x)\Delta x

is the linearization of ff at xx.

Local Linear Models

A nonlinear function may have complicated global behavior. Near one point, however, a differentiable function behaves like an affine map:

x+Δxf(x)+Jf(x)Δx x + \Delta x \mapsto f(x) + J_f(x)\Delta x

The affine approximation has two parts:

PartMeaning
f(x)f(x)base value
Jf(x)ΔxJ_f(x)\Delta xfirst-order change around the base value

The Jacobian itself is linear in the perturbation Δx\Delta x, but the full approximation includes the offset f(x)f(x). This is why we distinguish between a linear map and an affine approximation.

For scalar functions,

f:RR f : \mathbb{R} \to \mathbb{R}

linearization gives the familiar tangent line:

f(x+Δx)f(x)+f(x)Δx f(x + \Delta x) \approx f(x) + f'(x)\Delta x

For vector functions, the tangent line generalizes to a tangent linear map.

Linearization of a Program

Consider the program

u = x * y
v = sin(u)
z = v + x

The corresponding function is

z=sin(xy)+x z = \sin(xy) + x

Linearization introduces perturbations for every value:

xx+x˙ x \mapsto x + \dot{x} yy+y˙ y \mapsto y + \dot{y} uu+u˙ u \mapsto u + \dot{u} vv+v˙ v \mapsto v + \dot{v} zz+z˙ z \mapsto z + \dot{z}

The tangent program is obtained by differentiating each primitive operation locally:

u=xy u = xy u˙=yx˙+xy˙ \dot{u} = y\dot{x} + x\dot{y} v=sinu v = \sin u v˙=cos(u)u˙ \dot{v} = \cos(u)\dot{u} z=v+x z = v + x z˙=v˙+x˙ \dot{z} = \dot{v} + \dot{x}

The pair of programs,

u  = x * y
du = y * dx + x * dy

v  = sin(u)
dv = cos(u) * du

z  = v + x
dz = dv + dx

computes both the original value and its first-order change. This is forward mode AD in its most direct form.

Pushforward

The linearization of a function at a point is also called its pushforward.

For

f:XY f : X \to Y

the pushforward at xx maps tangent vectors at xx to tangent vectors at f(x)f(x):

fx:TxXTf(x)Y f_{*x} : T_x X \to T_{f(x)}Y

In Euclidean spaces, this is just multiplication by the Jacobian:

fx(v)=Jf(x)v f_{*x}(v) = J_f(x)v

Forward mode AD computes pushforwards. Given a primal value xx and a tangent vector vv, it computes

(f(x),Jf(x)v) (f(x), J_f(x)v)

This is the value of the function and the pushed-forward tangent.

Linearization as a Program Transform

Linearization can be described as a transformation on programs.

Given a program that computes

y=f(x) y = f(x)

the linearized program computes

(y,y˙)=lin(f)(x,x˙) (y, \dot{y}) = \operatorname{lin}(f)(x, \dot{x})

where

y˙=Jf(x)x˙ \dot{y} = J_f(x)\dot{x}

This transformed program runs alongside the original computation. Each primitive is replaced by a paired primal-and-tangent rule.

For example, addition becomes:

z=x+y z = x + y z˙=x˙+y˙ \dot{z} = \dot{x} + \dot{y}

Multiplication becomes:

z=xy z = xy z˙=yx˙+xy˙ \dot{z} = y\dot{x} + x\dot{y}

Sine becomes:

z=sinx z = \sin x z˙=cos(x)x˙ \dot{z} = \cos(x)\dot{x}

Exponential becomes:

z=ex z = e^x z˙=exx˙ \dot{z} = e^x\dot{x}

The transformed program remains executable ordinary code. This is one reason AD fits naturally into compilers and programming languages.

Linearization and Approximation Error

Linearization keeps only first-order terms. The approximation error is higher order in the perturbation.

For a scalar function with sufficient smoothness,

f(x+Δx)=f(x)+f(x)Δx+O((Δx)2) f(x + \Delta x) = f(x) + f'(x)\Delta x + O((\Delta x)^2)

For a vector function,

f(x+Δx)=f(x)+Jf(x)Δx+O(Δx2) f(x + \Delta x) = f(x) + J_f(x)\Delta x + O(\|\Delta x\|^2)

The notation O(Δx2)O(\|\Delta x\|^2) means that the ignored error shrinks quadratically as the perturbation norm goes to zero.

AD computes the first-order term exactly according to the executed operations and floating point arithmetic. It does not estimate the derivative by trying small perturbations. That separates AD from finite differences.

Linearization vs Finite Differences

Finite differences approximate derivatives by evaluating the function at nearby points:

f(x+ϵv)f(x)ϵJf(x)v \frac{f(x + \epsilon v) - f(x)}{\epsilon} \approx J_f(x)v

This approximation depends on the step size ϵ\epsilon. If ϵ\epsilon is too large, truncation error dominates. If ϵ\epsilon is too small, floating point cancellation dominates.

Forward mode AD computes the same directional derivative quantity,

Jf(x)v J_f(x)v

but without subtracting nearly equal function values. It propagates the derivative algebra through each primitive operation.

This makes AD much more accurate than finite differences for many computations, while retaining a program-execution model rather than symbolic expression manipulation.

Linearization and Composition

Linearization respects composition.

If

f=hg f = h \circ g

then

Jf(x)=Jh(g(x))Jg(x) J_f(x) = J_h(g(x))J_g(x)

Applying this to a perturbation vv,

Jf(x)v=Jh(g(x))(Jg(x)v) J_f(x)v = J_h(g(x)) \bigl( J_g(x)v \bigr)

This is exactly the operational behavior of forward mode:

  1. Push vv through gg.
  2. Push the resulting tangent through hh.

So the linearization transform is compositional. A large program can be linearized by linearizing its parts and wiring the tangent values through the same dependency graph.

Linearization of Control Flow

For executed control flow, linearization follows the path taken by the primal computation.

Consider

if x > 0:
    y = x * x
else:
    y = -x

For x>0x > 0, the executed branch gives

y=x2 y = x^2

and the tangent rule is

y˙=2xx˙ \dot{y} = 2x\dot{x}

For x<0x < 0, the executed branch gives

y=x y = -x

and the tangent rule is

y˙=x˙ \dot{y} = -\dot{x}

At x=0x = 0, the function has a kink, so the classical derivative does not exist. An AD system may still return a value depending on which branch executes and how the primitive is defined.

This is an important practical point. AD differentiates the executed program path. It does not automatically replace non-smooth code with generalized calculus unless such rules are explicitly built into the system.

Linearization of Loops

Loops can be linearized by applying tangent propagation to each iteration.

Suppose a recurrence is written as

st+1=F(st,θ) s_{t+1} = F(s_t, \theta)

where sts_t is state and θ\theta is a parameter vector. The tangent recurrence is

s˙t+1=Fs(st,θ)s˙t+Fθ(st,θ)θ˙ \dot{s}_{t+1} = \frac{\partial F}{\partial s}(s_t, \theta)\dot{s}_t + \frac{\partial F}{\partial \theta}(s_t, \theta)\dot{\theta}

Forward mode propagates this tangent state through time together with the primal state.

This is useful for sensitivity analysis. If θ\theta has a small number of components, forward propagation can compute how the trajectory changes under parameter perturbations.

Matrix-Free Linearization

In large systems, the Jacobian may be too large to materialize.

For example, if

f:R107R107 f : \mathbb{R}^{10^7} \to \mathbb{R}^{10^7}

then the Jacobian has 101410^{14} entries. Storing it explicitly is infeasible.

But computing

Jf(x)v J_f(x)v

may still be feasible. The result has only 10710^7 entries, the same shape as the output. Forward mode gives a matrix-free representation of the Jacobian action.

Many numerical algorithms need only this action. Krylov methods, sensitivity methods, implicit solvers, and stability analysis often work through products rather than explicit derivative matrices.

Linearization in AD Systems

Different AD systems expose linearization in different ways.

A library may provide a function like:

jvp(f, x, v) -> (f(x), J_f(x)v)

A compiler may transform code into a paired primal-tangent program.

A tracing system may record a graph and attach tangent propagation rules to graph nodes.

Despite implementation differences, the mathematical object is the same:

(x,v)(f(x),Jf(x)v) (x, v) \mapsto (f(x), J_f(x)v)

This object is the forward-mode derivative operator.

Role in the Rest of AD

Linearization is the foundation for several later ideas.

Forward mode is direct evaluation of a linearized program. Reverse mode can be understood as transposing a linearized program. Higher-order AD differentiates linearized programs again. Implicit differentiation uses linearized equations. Numerical stability analysis studies how perturbations move through the linearized computation.

The important distinction is this:

The original program maps values to values. The linearized program maps values and input perturbations to values and output perturbations.

Once this distinction is clear, many AD mechanisms become systematic rather than mysterious.