Many programs do not compute their output by applying a fixed sequence of explicit operations. Instead, they define the output as the solution of another problem.
Implicit Differentiation
Many programs do not compute their output by applying a fixed sequence of explicit operations. Instead, they define the output as the solution of another problem.
A linear solver returns x such that
An optimizer returns z such that
A fixed-point solver returns x such that
In all three cases, the output depends on input parameters, but the dependence is indirect. We do not write
as a closed-form expression. We define x by a condition that it must satisfy.
Implicit differentiation gives derivatives of such outputs without differentiating through every internal iteration of the solver.
Explicit vs Implicit Definitions
In explicit differentiation, we have a function
and compute
In implicit differentiation, the variable of interest is defined by an equation
Here x is the input parameter, and y is the value chosen so that the equation holds.
Assume that y changes smoothly with x. Then there exists a local function
such that
Differentiate both sides with respect to x:
By the chain rule,
Therefore,
This is the central formula of implicit differentiation.
For scalar x and scalar y, this becomes
For vector-valued systems, F_y is a Jacobian matrix with respect to the solved variable, and F_x is a Jacobian with respect to the parameter.
Example: Scalar Equation
Suppose y is defined by
Let
Then
So
We did not solve for y explicitly. The derivative is expressed in terms of the solution y and the input x.
This is the main advantage: once the solver gives us y, the derivative can often be computed by solving a linear system, not by replaying the whole solve process.
Vector Form
Let
where
Assume the Jacobian
is invertible at the solution.
Then
Hence
In implementation, one usually avoids forming the inverse. Instead, solve the linear system
The result X is the Jacobian of the solution with respect to the parameters.
Reverse Mode Form
In machine learning and scientific computing, we often need gradients of a scalar loss
where z is defined implicitly by
Reverse mode avoids computing the full Jacobian
Let
We need the contribution
From implicit differentiation,
So
Define an adjoint variable λ by solving
Then
This is the reverse-mode implicit differentiation rule.
It requires one linear solve involving
It does not require differentiating through every solver iteration.
Fixed-Point Differentiation
A common implicit form is a fixed point:
Rewrite it as
Then
Therefore,
In reverse mode, solve
then compute
This appears in equilibrium models, recurrent computation at convergence, iterative refinement, and some differentiable physics systems.
Optimization as an Implicit Layer
Suppose z is the minimizer of
At a smooth local optimum,
Define
Then
the Hessian with respect to z, and
Thus,
This is the basis of differentiating through optimization problems. It is used in bilevel optimization, meta-learning, differentiable control, and optimization layers.
The formula is valid only when the optimum is locally well behaved. If the Hessian is singular, or if the optimizer reaches a boundary or a non-smooth point, the derivative may be undefined or set-valued.
Why Not Differentiate Through the Solver?
A solver may run for hundreds or thousands of iterations. Differentiating through every iteration has several costs.
First, reverse mode must retain or reconstruct intermediate states. This can consume large memory.
Second, the gradient depends on the exact sequence of iterations. If the solver is only an algorithm for finding the solution, not part of the mathematical model, this derivative may be the wrong abstraction.
Third, long iterative chains can produce unstable gradients. The derivative may vanish, explode, or reflect numerical artifacts of the solver rather than sensitivity of the solved system.
Implicit differentiation treats the solver as a way to satisfy an equation. The derivative comes from the equation, not from the path taken to solve it.
Computational Pattern
A typical implicit differentiation implementation has four parts.
- Run the primal solver and obtain
z. - Define the residual function
F(z, θ). - Use AD to compute Jacobian-vector products or vector-Jacobian products involving
F_zandF_θ. - Solve the required linear system for the tangent or adjoint.
For forward mode, solve
For reverse mode, solve
Then compute
Large systems rarely form F_z explicitly. They use matrix-free methods such as conjugate gradient, GMRES, or custom linear solvers, where each matrix-vector product is computed by AD.
Conditions for Validity
Implicit differentiation requires local regularity.
The equation
must have a locally unique solution z near the point of interest. A sufficient condition is that F is continuously differentiable and F_z is invertible at the solution.
When F_z is singular, several problems may occur:
- the solution may not be unique,
- the solution may change discontinuously,
- the derivative may become infinite,
- the derivative may not exist.
For constrained optimization, the same idea applies through the Karush-Kuhn-Tucker conditions, but the derivative depends on active constraints and constraint qualifications.
Practical Failure Modes
Implicit differentiation can fail silently if the residual equation is wrong. The residual must describe the mathematical condition satisfied by the solver output.
It can also fail when the primal solve is inaccurate. If
because the solver stopped early, the implicit gradient is the gradient of an approximate solution. Sometimes this is acceptable. Sometimes it causes biased gradients.
Linear solve accuracy also matters. A poor adjoint solve gives a poor parameter gradient, even when the primal solution is good.
Non-smooth operations require care. If the solution depends on argmax, sorting, clipping, contact events, or active-set changes, the classical derivative may exist only piecewise.
Role in AD Systems
Implicit differentiation extends AD beyond ordinary program traces.
A normal AD system differentiates the operations that were executed. Implicit differentiation differentiates the equation that the executed program solves.
This distinction is important for modern differentiable systems. Many useful layers are solvers:
Examples include linear systems, nonlinear equations, convex optimization problems, equilibrium networks, ODE boundary-value problems, and simulation constraints.
An AD system that supports implicit differentiation can expose such solvers as differentiable primitives. The solver becomes a black box in the forward pass, but its derivative is supplied by a mathematically defined backward rule.
This is one of the main techniques for scaling automatic differentiation from simple computational graphs to large numerical programs.