Skip to content

Chapter 18. Advanced Topics

Many programs do not compute their output by applying a fixed sequence of explicit operations. Instead, they define the output as the solution of another problem.

Implicit Differentiation

Many programs do not compute their output by applying a fixed sequence of explicit operations. Instead, they define the output as the solution of another problem.

A linear solver returns x such that

Ax=b. Ax=b.

An optimizer returns z such that

zL(z,θ)=0. \nabla_z L(z,\theta)=0.

A fixed-point solver returns x such that

x=F(x,θ). x = F(x,\theta).

In all three cases, the output depends on input parameters, but the dependence is indirect. We do not write

x=f(θ) x = f(\theta)

as a closed-form expression. We define x by a condition that it must satisfy.

Implicit differentiation gives derivatives of such outputs without differentiating through every internal iteration of the solver.

Explicit vs Implicit Definitions

In explicit differentiation, we have a function

y=f(x) y = f(x)

and compute

dydx. \frac{dy}{dx}.

In implicit differentiation, the variable of interest is defined by an equation

F(y,x)=0. F(y,x)=0.

Here x is the input parameter, and y is the value chosen so that the equation holds.

Assume that y changes smoothly with x. Then there exists a local function

y=g(x) y = g(x)

such that

F(g(x),x)=0. F(g(x),x)=0.

Differentiate both sides with respect to x:

ddxF(g(x),x)=0. \frac{d}{dx}F(g(x),x)=0.

By the chain rule,

Fydydx+Fx=0. F_y \frac{dy}{dx} + F_x = 0.

Therefore,

dydx=Fy1Fx. \frac{dy}{dx} = - F_y^{-1} F_x.

This is the central formula of implicit differentiation.

For scalar x and scalar y, this becomes

dydx=F/xF/y. \frac{dy}{dx} = - \frac{\partial F/\partial x}{\partial F/\partial y}.

For vector-valued systems, F_y is a Jacobian matrix with respect to the solved variable, and F_x is a Jacobian with respect to the parameter.

Example: Scalar Equation

Suppose y is defined by

y2+xy1=0. y^2 + x y - 1 = 0.

Let

F(y,x)=y2+xy1. F(y,x)=y^2+xy-1.

Then

Fy=2y+x,Fx=y. F_y = 2y+x, \qquad F_x = y.

So

dydx=y2y+x. \frac{dy}{dx} = - \frac{y}{2y+x}.

We did not solve for y explicitly. The derivative is expressed in terms of the solution y and the input x.

This is the main advantage: once the solver gives us y, the derivative can often be computed by solving a linear system, not by replaying the whole solve process.

Vector Form

Let

F(z,θ)=0, F(z,\theta)=0,

where

zRn,θRp,F:Rn×RpRn. z \in \mathbb{R}^n, \qquad \theta \in \mathbb{R}^p, \qquad F:\mathbb{R}^n \times \mathbb{R}^p \to \mathbb{R}^n.

Assume the Jacobian

Fz=Fz F_z = \frac{\partial F}{\partial z}

is invertible at the solution.

Then

Fzzθ+Fθ=0. F_z \frac{\partial z}{\partial \theta} + F_\theta = 0.

Hence

zθ=Fz1Fθ. \frac{\partial z}{\partial \theta} = - F_z^{-1}F_\theta.

In implementation, one usually avoids forming the inverse. Instead, solve the linear system

FzX=Fθ. F_z X = -F_\theta.

The result X is the Jacobian of the solution with respect to the parameters.

Reverse Mode Form

In machine learning and scientific computing, we often need gradients of a scalar loss

(z,θ) \ell(z,\theta)

where z is defined implicitly by

F(z,θ)=0. F(z,\theta)=0.

Reverse mode avoids computing the full Jacobian

zθ. \frac{\partial z}{\partial \theta}.

Let

zˉ=z. \bar{z} = \frac{\partial \ell}{\partial z}.

We need the contribution

θˉ=zˉzθ. \bar{\theta} = \bar{z} \frac{\partial z}{\partial \theta}.

From implicit differentiation,

zθ=Fz1Fθ. \frac{\partial z}{\partial \theta} = - F_z^{-1}F_\theta.

So

θˉ=zˉFz1Fθ. \bar{\theta} = - \bar{z}F_z^{-1}F_\theta.

Define an adjoint variable λ by solving

FzTλ=zˉT. F_z^T \lambda = \bar{z}^T.

Then

θˉ=λTFθ. \bar{\theta} = - \lambda^T F_\theta.

This is the reverse-mode implicit differentiation rule.

It requires one linear solve involving

FzT. F_z^T.

It does not require differentiating through every solver iteration.

Fixed-Point Differentiation

A common implicit form is a fixed point:

z=G(z,θ). z = G(z,\theta).

Rewrite it as

F(z,θ)=zG(z,θ)=0. F(z,\theta)=z-G(z,\theta)=0.

Then

Fz=IGz,Fθ=Gθ. F_z = I - G_z, \qquad F_\theta = -G_\theta.

Therefore,

zθ=(IGz)1Gθ. \frac{\partial z}{\partial \theta} = (I-G_z)^{-1}G_\theta.

In reverse mode, solve

(IGz)Tλ=zˉT, (I-G_z)^T \lambda = \bar{z}^T,

then compute

θˉ=λTGθ. \bar{\theta} = \lambda^T G_\theta.

This appears in equilibrium models, recurrent computation at convergence, iterative refinement, and some differentiable physics systems.

Optimization as an Implicit Layer

Suppose z is the minimizer of

L(z,θ). L(z,\theta).

At a smooth local optimum,

zL(z,θ)=0. \nabla_z L(z,\theta)=0.

Define

F(z,θ)=zL(z,θ). F(z,\theta)=\nabla_z L(z,\theta).

Then

Fz=zz2L(z,θ), F_z = \nabla^2_{zz}L(z,\theta),

the Hessian with respect to z, and

Fθ=zθ2L(z,θ). F_\theta = \nabla^2_{z\theta}L(z,\theta).

Thus,

zθ=(zz2L)1zθ2L. \frac{\partial z}{\partial \theta} = - \left(\nabla^2_{zz}L\right)^{-1} \nabla^2_{z\theta}L.

This is the basis of differentiating through optimization problems. It is used in bilevel optimization, meta-learning, differentiable control, and optimization layers.

The formula is valid only when the optimum is locally well behaved. If the Hessian is singular, or if the optimizer reaches a boundary or a non-smooth point, the derivative may be undefined or set-valued.

Why Not Differentiate Through the Solver?

A solver may run for hundreds or thousands of iterations. Differentiating through every iteration has several costs.

First, reverse mode must retain or reconstruct intermediate states. This can consume large memory.

Second, the gradient depends on the exact sequence of iterations. If the solver is only an algorithm for finding the solution, not part of the mathematical model, this derivative may be the wrong abstraction.

Third, long iterative chains can produce unstable gradients. The derivative may vanish, explode, or reflect numerical artifacts of the solver rather than sensitivity of the solved system.

Implicit differentiation treats the solver as a way to satisfy an equation. The derivative comes from the equation, not from the path taken to solve it.

Computational Pattern

A typical implicit differentiation implementation has four parts.

  1. Run the primal solver and obtain z.
  2. Define the residual function F(z, θ).
  3. Use AD to compute Jacobian-vector products or vector-Jacobian products involving F_z and F_θ.
  4. Solve the required linear system for the tangent or adjoint.

For forward mode, solve

Fzz˙=Fθθ˙. F_z \dot{z} = -F_\theta \dot{\theta}.

For reverse mode, solve

FzTλ=zˉT. F_z^T \lambda = \bar{z}^T.

Then compute

θˉ=λTFθ. \bar{\theta} = -\lambda^T F_\theta.

Large systems rarely form F_z explicitly. They use matrix-free methods such as conjugate gradient, GMRES, or custom linear solvers, where each matrix-vector product is computed by AD.

Conditions for Validity

Implicit differentiation requires local regularity.

The equation

F(z,θ)=0 F(z,\theta)=0

must have a locally unique solution z near the point of interest. A sufficient condition is that F is continuously differentiable and F_z is invertible at the solution.

When F_z is singular, several problems may occur:

  • the solution may not be unique,
  • the solution may change discontinuously,
  • the derivative may become infinite,
  • the derivative may not exist.

For constrained optimization, the same idea applies through the Karush-Kuhn-Tucker conditions, but the derivative depends on active constraints and constraint qualifications.

Practical Failure Modes

Implicit differentiation can fail silently if the residual equation is wrong. The residual must describe the mathematical condition satisfied by the solver output.

It can also fail when the primal solve is inaccurate. If

F(z,θ)0 F(z,\theta) \neq 0

because the solver stopped early, the implicit gradient is the gradient of an approximate solution. Sometimes this is acceptable. Sometimes it causes biased gradients.

Linear solve accuracy also matters. A poor adjoint solve gives a poor parameter gradient, even when the primal solution is good.

Non-smooth operations require care. If the solution depends on argmax, sorting, clipping, contact events, or active-set changes, the classical derivative may exist only piecewise.

Role in AD Systems

Implicit differentiation extends AD beyond ordinary program traces.

A normal AD system differentiates the operations that were executed. Implicit differentiation differentiates the equation that the executed program solves.

This distinction is important for modern differentiable systems. Many useful layers are solvers:

inputsolvesolutionloss. \text{input} \to \text{solve} \to \text{solution} \to \text{loss}.

Examples include linear systems, nonlinear equations, convex optimization problems, equilibrium networks, ODE boundary-value problems, and simulation constraints.

An AD system that supports implicit differentiation can expose such solvers as differentiable primitives. The solver becomes a black box in the forward pass, but its derivative is supplied by a mathematically defined backward rule.

This is one of the main techniques for scaling automatic differentiation from simple computational graphs to large numerical programs.