A solver is a program that computes a value by search, iteration, or factorization. Instead of evaluating a closed-form expression, it finds a value that satisfies a condition.
A solver is a program that computes a value by search, iteration, or factorization. Instead of evaluating a closed-form expression, it finds a value that satisfies a condition.
Common examples include:
| Solver type | Output condition |
|---|---|
| Linear solver | Ax = b |
| Nonlinear root solver | F(x, θ) = 0 |
| Fixed-point solver | x = G(x, θ) |
| Optimizer | ∇x L(x, θ) = 0 |
| ODE solver | trajectory satisfies differential equation |
| Projection operator | output lies in constraint set |
Automatic differentiation can treat solvers in two different ways.
The first method differentiates the solver algorithm. The second method differentiates the mathematical problem solved by the algorithm.
These two methods often give different derivatives.
Differentiating the Algorithm
Suppose a solver runs an iteration
for K steps. The program returns
A normal AD system differentiates exactly this computation:
This is called unrolled differentiation.
For example:
x = x0
for k in 1..K:
x = G(x, theta)
return xReverse mode stores or reconstructs the intermediate states
x0, x1, ..., xKand propagates adjoints backward through the loop.
This method is simple and general. It works with ordinary AD machinery. It also gives the derivative of the actual finite computation.
That matters when the number of iterations is part of the model. For example, an iterative neural network block with exactly ten steps should usually be differentiated as ten steps.
Differentiating the Solved Problem
If the solver is only a numerical method for finding a solution, the finite iteration path may be irrelevant.
Suppose the intended output is the fixed point
The iteration count is an implementation detail. We want the derivative of the fixed point, not the derivative of the finite approximation.
In that case, define
Then
Implicit differentiation gives
This derivative depends on the local equation at the solution. It does not depend on the sequence of iterates used to reach the solution.
Unrolled vs Implicit Derivatives
The distinction is important.
For K unrolled steps,
measures sensitivity of the finite computation.
For the converged solution,
measures sensitivity of the mathematical fixed point.
As K becomes large, these may approach the same value if the iteration converges smoothly and the derivatives are stable. In finite precision, they may still differ because convergence tolerances, stopping rules, and numerical conditioning matter.
| Method | Differentiates | Main cost | Main risk |
|---|---|---|---|
| Unrolled AD | solver execution trace | memory proportional to iterations | unstable long chains |
| Implicit AD | solution condition | linear solve at solution | invalid residual or singular Jacobian |
Example: Newton Solver
Consider a scalar equation
The solution is
for positive θ.
Newton’s method computes
Unrolled AD differentiates the finite sequence of Newton steps. The derivative depends on the initial value x0 and the number of iterations K.
Implicit differentiation uses
Therefore,
At convergence, this equals the derivative of the true solution. It is independent of the Newton path.
Reverse Mode Through a Solver
Let a solver produce x, and let a scalar loss depend on it:
If x satisfies
then reverse-mode implicit differentiation solves
where
Then the parameter gradient contribution is
This is the standard backward rule for a root-solving layer.
In implementation, the backward pass usually needs three operations:
vjp_x(v) = v^T F_x
vjp_theta(v) = v^T F_theta
linear_solve(F_x^T, xbar)The linear solve gives λ. The vector-Jacobian product with respect to θ gives the final gradient.
Linear Solvers
Linear systems are the simplest case.
Let
The residual is
Differentiate:
So
For reverse mode, given an upstream adjoint xbar, solve
Then
and
This rule is widely used in differentiable scientific computing. It avoids differentiating through Gaussian elimination, conjugate gradient iterations, or a library call such as LAPACK.
The derivative depends on the equation Ax=b, not on the particular algorithm used to solve it.
Iterative Linear Solvers
For large systems, A may never be formed explicitly. A conjugate-gradient solver may only need matrix-vector products:
y = matvec_A(x)The backward pass can also be matrix-free. It needs products with A^T.
If A is symmetric, then A^T = A, and the same matrix-vector product can be reused.
For non-symmetric systems, the implementation must provide a transpose matvec or allow AD to construct one. Without this, reverse-mode differentiation may be impossible or inefficient.
The key interface is:
solve(A, b) -> x
transpose_solve(A, xbar) -> lambdaA production AD system should allow custom derivative rules for such solver primitives.
Nonlinear Solvers
For nonlinear equations,
the forward pass may use Newton, quasi-Newton, trust-region methods, or fixed-point iteration.
The backward pass needs a linearized solve:
The matrix F_x is the Jacobian of the residual at the final solution.
This has an important consequence: the nonlinear solver and backward linear solver are different computations. A robust forward nonlinear solve does not automatically imply a stable backward solve.
The derivative can be poor when F_x is ill-conditioned. Small perturbations in θ can cause large changes in x.
Stopping Rules
Solvers usually stop when a tolerance is reached:
while norm(F(x, theta)) > tol:
x = step(x, theta)This introduces a discontinuity in the execution trace. A tiny change in θ can change the number of iterations.
Unrolled AD differentiates the path that occurred. It usually ignores the derivative of the discrete stopping decision.
Implicit AD avoids this specific issue by differentiating the final residual equation. But it assumes the returned value is close enough to a true solution.
A practical rule is:
must be small enough that the implicit gradient is meaningful.
If the primal residual is large, the backward pass may report a precise derivative of the wrong object.
Solver Accuracy and Gradient Accuracy
Forward accuracy and gradient accuracy are related but separate.
A solver may produce a value with small forward error but poor sensitivity when the problem is ill-conditioned. Conversely, a moderately inaccurate solution may yield acceptable gradients if the loss is insensitive to the error.
For implicit differentiation, three quantities matter:
| Quantity | Meaning |
|---|---|
| primal residual | how well x satisfies F(x, θ)=0 |
| Jacobian conditioning | sensitivity of the solution map |
| adjoint solve residual | how accurately the backward linear system is solved |
A useful implementation records all three. Silent gradient errors are common when only the forward solver status is checked.
Differentiating Solver Libraries
Many solvers live outside the AD system: BLAS, LAPACK, PETSc, SUNDIALS, IPOPT, OSQP, custom CUDA kernels, or domain-specific simulators.
An AD system cannot inspect these calls unless they are written in traceable code. There are two common solutions.
The first is wrapping the solver as a primitive with a custom derivative rule.
The second is rewriting the solver in a differentiable language.
Wrapping is usually better for mature numerical libraries. The library already has stable algorithms. The AD system only needs a correct mathematical derivative.
Rewriting is useful when the solver itself contains differentiable modeling choices, learned parameters, or custom control flow.
Custom Derivative Rule
A solver primitive can be described as:
x = solve(theta)with residual:
F(x, theta) = 0The custom backward rule is:
backward(xbar):
lambda = solve_transpose_jacobian(F, x, theta, xbar)
thetabar = - vjp_theta(F, x, theta, lambda)
return thetabarThis separates the primal solver from the derivative logic.
The forward solver may use one algorithm. The backward solve may use another algorithm. Both must correspond to the same residual equation.
Memory Behavior
Unrolled reverse mode stores intermediate states unless checkpointing is used. For long solvers, this can dominate memory.
Implicit differentiation stores only the final solution and enough data to evaluate the residual and its derivatives.
| Solver length | Unrolled reverse memory | Implicit reverse memory |
|---|---|---|
| 10 steps | small | small |
| 1,000 steps | large | small |
| 1,000,000 steps | usually infeasible | possible if residual products are cheap |
This is one of the main reasons implicit methods are used for equilibrium models, PDE solvers, and optimization layers.
When Unrolling Is Better
Implicit differentiation is not always the right abstraction.
Unrolling is better when the algorithm itself is the model. Examples include learned optimizers, recurrent networks with a fixed number of steps, iterative refinement blocks trained end-to-end, and procedures whose early stopping behavior is intentional.
Unrolling also works when the residual equation is hard to define or when the solver is not actually converging to a well-defined solution.
A practical test is simple:
If changing the solver algorithm should change the model, use unrolled differentiation.
If changing the solver algorithm should preserve the model, use implicit differentiation.
Summary
Differentiating through solvers requires choosing the right object of differentiation.
Unrolled AD differentiates the computation that was executed. It is general, direct, and often expensive.
Implicit AD differentiates the equation solved by the computation. It is usually more memory efficient and more mathematically stable, provided the residual is correct, the solution is accurate, and the linearized system is well conditioned.
For modern AD systems, solvers should often be treated as differentiable primitives with custom forward and backward rules. This gives the programmer control over the mathematical meaning of the derivative while preserving the performance of specialized numerical code.