Differentiating Through Solvers

A solver is a program that computes a value by search, iteration, or factorization. Instead of evaluating a closed-form expression, it finds a value that satisfies a condition.

Common examples include:

Solver type	Output condition
Linear solver	`Ax = b`
Nonlinear root solver	`F(x, θ) = 0`
Fixed-point solver	`x = G(x, θ)`
Optimizer	`∇x L(x, θ) = 0`
ODE solver	trajectory satisfies differential equation
Projection operator	output lies in constraint set

Automatic differentiation can treat solvers in two different ways.

The first method differentiates the solver algorithm. The second method differentiates the mathematical problem solved by the algorithm.

These two methods often give different derivatives.

Differentiating the Algorithm

Suppose a solver runs an iteration

x_{k+1} = G(x_k, \theta)

for K steps. The program returns

x_K.

A normal AD system differentiates exactly this computation:

\theta \mapsto x_0 \mapsto x_1 \mapsto \cdots \mapsto x_K.

This is called unrolled differentiation.

For example:

x = x0
for k in 1..K:
    x = G(x, theta)
return x

Reverse mode stores or reconstructs the intermediate states

x0, x1, ..., xK

and propagates adjoints backward through the loop.

This method is simple and general. It works with ordinary AD machinery. It also gives the derivative of the actual finite computation.

That matters when the number of iterations is part of the model. For example, an iterative neural network block with exactly ten steps should usually be differentiated as ten steps.

Differentiating the Solved Problem

If the solver is only a numerical method for finding a solution, the finite iteration path may be irrelevant.

Suppose the intended output is the fixed point

x^* = G(x^*, \theta).

The iteration count is an implementation detail. We want the derivative of the fixed point, not the derivative of the finite approximation.

In that case, define

F(x,\theta)=x-G(x,\theta).

Then

F(x^*,\theta)=0.

Implicit differentiation gives

\frac{\partial x^*}{\partial \theta} = (I-G_x)^{-1}G_\theta.

This derivative depends on the local equation at the solution. It does not depend on the sequence of iterates used to reach the solution.

Unrolled vs Implicit Derivatives

The distinction is important.

For K unrolled steps,

\frac{\partial x_K}{\partial \theta}

measures sensitivity of the finite computation.

For the converged solution,

\frac{\partial x^*}{\partial \theta}

measures sensitivity of the mathematical fixed point.

As K becomes large, these may approach the same value if the iteration converges smoothly and the derivatives are stable. In finite precision, they may still differ because convergence tolerances, stopping rules, and numerical conditioning matter.

Method	Differentiates	Main cost	Main risk
Unrolled AD	solver execution trace	memory proportional to iterations	unstable long chains
Implicit AD	solution condition	linear solve at solution	invalid residual or singular Jacobian

Example: Newton Solver

Consider a scalar equation

F(x,\theta)=x^2-\theta=0.

The solution is

x^*=\sqrt{\theta}

for positive θ.

Newton’s method computes

x_{k+1} = x_k - \frac{x_k^2-\theta}{2x_k}.

Unrolled AD differentiates the finite sequence of Newton steps. The derivative depends on the initial value x0 and the number of iterations K.

Implicit differentiation uses

F_x = 2x, \qquad F_\theta = -1.

Therefore,

\frac{dx^*}{d\theta} = - F_x^{-1}F_\theta = \frac{1}{2x^*}.

At convergence, this equals the derivative of the true solution. It is independent of the Newton path.

Reverse Mode Through a Solver

Let a solver produce x, and let a scalar loss depend on it:

\ell = L(x,\theta).

If x satisfies

F(x,\theta)=0,

then reverse-mode implicit differentiation solves

F_x^T \lambda = \bar{x}^T,

where

\bar{x}=\frac{\partial \ell}{\partial x}.

Then the parameter gradient contribution is

\bar{\theta} = - \lambda^T F_\theta.

This is the standard backward rule for a root-solving layer.

In implementation, the backward pass usually needs three operations:

vjp_x(v)      = v^T F_x
vjp_theta(v) = v^T F_theta
linear_solve(F_x^T, xbar)

The linear solve gives λ. The vector-Jacobian product with respect to θ gives the final gradient.

Linear Solvers

Linear systems are the simplest case.

Let

Ax=b.

The residual is

F(x,A,b)=Ax-b.

Differentiate:

A\,dx + dA\,x - db = 0.

dx = A^{-1}(db-dA\,x).

For reverse mode, given an upstream adjoint xbar, solve

A^T \lambda = \bar{x}.

Then

\bar{b} = \lambda,

and

\bar{A} = -\lambda x^T.

This rule is widely used in differentiable scientific computing. It avoids differentiating through Gaussian elimination, conjugate gradient iterations, or a library call such as LAPACK.

The derivative depends on the equation Ax=b, not on the particular algorithm used to solve it.

Iterative Linear Solvers

For large systems, A may never be formed explicitly. A conjugate-gradient solver may only need matrix-vector products:

y = matvec_A(x)

The backward pass can also be matrix-free. It needs products with A^T.

If A is symmetric, then A^T = A, and the same matrix-vector product can be reused.

For non-symmetric systems, the implementation must provide a transpose matvec or allow AD to construct one. Without this, reverse-mode differentiation may be impossible or inefficient.

The key interface is:

solve(A, b) -> x
transpose_solve(A, xbar) -> lambda

A production AD system should allow custom derivative rules for such solver primitives.

Nonlinear Solvers

For nonlinear equations,

F(x,\theta)=0,

the forward pass may use Newton, quasi-Newton, trust-region methods, or fixed-point iteration.

The backward pass needs a linearized solve:

F_x^T \lambda = \bar{x}^T.

The matrix F_x is the Jacobian of the residual at the final solution.

This has an important consequence: the nonlinear solver and backward linear solver are different computations. A robust forward nonlinear solve does not automatically imply a stable backward solve.

The derivative can be poor when F_x is ill-conditioned. Small perturbations in θ can cause large changes in x.

Stopping Rules

Solvers usually stop when a tolerance is reached:

while norm(F(x, theta)) > tol:
    x = step(x, theta)

This introduces a discontinuity in the execution trace. A tiny change in θ can change the number of iterations.

Unrolled AD differentiates the path that occurred. It usually ignores the derivative of the discrete stopping decision.

Implicit AD avoids this specific issue by differentiating the final residual equation. But it assumes the returned value is close enough to a true solution.

A practical rule is:

\|F(x,\theta)\|

must be small enough that the implicit gradient is meaningful.

If the primal residual is large, the backward pass may report a precise derivative of the wrong object.

Solver Accuracy and Gradient Accuracy

Forward accuracy and gradient accuracy are related but separate.

A solver may produce a value with small forward error but poor sensitivity when the problem is ill-conditioned. Conversely, a moderately inaccurate solution may yield acceptable gradients if the loss is insensitive to the error.

For implicit differentiation, three quantities matter:

Quantity	Meaning
primal residual	how well `x` satisfies `F(x, θ)=0`
Jacobian conditioning	sensitivity of the solution map
adjoint solve residual	how accurately the backward linear system is solved

A useful implementation records all three. Silent gradient errors are common when only the forward solver status is checked.

Differentiating Solver Libraries

Many solvers live outside the AD system: BLAS, LAPACK, PETSc, SUNDIALS, IPOPT, OSQP, custom CUDA kernels, or domain-specific simulators.

An AD system cannot inspect these calls unless they are written in traceable code. There are two common solutions.

The first is wrapping the solver as a primitive with a custom derivative rule.

The second is rewriting the solver in a differentiable language.

Wrapping is usually better for mature numerical libraries. The library already has stable algorithms. The AD system only needs a correct mathematical derivative.

Rewriting is useful when the solver itself contains differentiable modeling choices, learned parameters, or custom control flow.

Custom Derivative Rule

A solver primitive can be described as:

x = solve(theta)

with residual:

F(x, theta) = 0

The custom backward rule is:

backward(xbar):
    lambda = solve_transpose_jacobian(F, x, theta, xbar)
    thetabar = - vjp_theta(F, x, theta, lambda)
    return thetabar

This separates the primal solver from the derivative logic.

The forward solver may use one algorithm. The backward solve may use another algorithm. Both must correspond to the same residual equation.

Memory Behavior

Unrolled reverse mode stores intermediate states unless checkpointing is used. For long solvers, this can dominate memory.

Implicit differentiation stores only the final solution and enough data to evaluate the residual and its derivatives.

Solver length	Unrolled reverse memory	Implicit reverse memory
10 steps	small	small
1,000 steps	large	small
1,000,000 steps	usually infeasible	possible if residual products are cheap

This is one of the main reasons implicit methods are used for equilibrium models, PDE solvers, and optimization layers.

When Unrolling Is Better

Implicit differentiation is not always the right abstraction.

Unrolling is better when the algorithm itself is the model. Examples include learned optimizers, recurrent networks with a fixed number of steps, iterative refinement blocks trained end-to-end, and procedures whose early stopping behavior is intentional.

Unrolling also works when the residual equation is hard to define or when the solver is not actually converging to a well-defined solution.

A practical test is simple:

If changing the solver algorithm should change the model, use unrolled differentiation.

If changing the solver algorithm should preserve the model, use implicit differentiation.

Summary

Differentiating through solvers requires choosing the right object of differentiation.

Unrolled AD differentiates the computation that was executed. It is general, direct, and often expensive.

Implicit AD differentiates the equation solved by the computation. It is usually more memory efficient and more mathematically stable, provided the residual is correct, the solution is accurate, and the linearized system is well conditioned.

For modern AD systems, solvers should often be treated as differentiable primitives with custom forward and backward rules. This gives the programmer control over the mathematical meaning of the derivative while preserving the performance of specialized numerical code.