# Differentiating Through Solvers

## Differentiating Through Solvers

A solver is a program that computes a value by search, iteration, or factorization. Instead of evaluating a closed-form expression, it finds a value that satisfies a condition.

Common examples include:

| Solver type | Output condition |
|---|---|
| Linear solver | `Ax = b` |
| Nonlinear root solver | `F(x, θ) = 0` |
| Fixed-point solver | `x = G(x, θ)` |
| Optimizer | `∇x L(x, θ) = 0` |
| ODE solver | trajectory satisfies differential equation |
| Projection operator | output lies in constraint set |

Automatic differentiation can treat solvers in two different ways.

The first method differentiates the solver algorithm. The second method differentiates the mathematical problem solved by the algorithm.

These two methods often give different derivatives.

## Differentiating the Algorithm

Suppose a solver runs an iteration

$$
x_{k+1} = G(x_k, \theta)
$$

for `K` steps. The program returns

$$
x_K.
$$

A normal AD system differentiates exactly this computation:

$$
\theta \mapsto x_0 \mapsto x_1 \mapsto \cdots \mapsto x_K.
$$

This is called unrolled differentiation.

For example:

```text
x = x0
for k in 1..K:
    x = G(x, theta)
return x
```

Reverse mode stores or reconstructs the intermediate states

```text
x0, x1, ..., xK
```

and propagates adjoints backward through the loop.

This method is simple and general. It works with ordinary AD machinery. It also gives the derivative of the actual finite computation.

That matters when the number of iterations is part of the model. For example, an iterative neural network block with exactly ten steps should usually be differentiated as ten steps.

## Differentiating the Solved Problem

If the solver is only a numerical method for finding a solution, the finite iteration path may be irrelevant.

Suppose the intended output is the fixed point

$$
x^* = G(x^*, \theta).
$$

The iteration count is an implementation detail. We want the derivative of the fixed point, not the derivative of the finite approximation.

In that case, define

$$
F(x,\theta)=x-G(x,\theta).
$$

Then

$$
F(x^*,\theta)=0.
$$

Implicit differentiation gives

$$
\frac{\partial x^*}{\partial \theta} =
(I-G_x)^{-1}G_\theta.
$$

This derivative depends on the local equation at the solution. It does not depend on the sequence of iterates used to reach the solution.

## Unrolled vs Implicit Derivatives

The distinction is important.

For `K` unrolled steps,

$$
\frac{\partial x_K}{\partial \theta}
$$

measures sensitivity of the finite computation.

For the converged solution,

$$
\frac{\partial x^*}{\partial \theta}
$$

measures sensitivity of the mathematical fixed point.

As `K` becomes large, these may approach the same value if the iteration converges smoothly and the derivatives are stable. In finite precision, they may still differ because convergence tolerances, stopping rules, and numerical conditioning matter.

| Method | Differentiates | Main cost | Main risk |
|---|---|---|---|
| Unrolled AD | solver execution trace | memory proportional to iterations | unstable long chains |
| Implicit AD | solution condition | linear solve at solution | invalid residual or singular Jacobian |

## Example: Newton Solver

Consider a scalar equation

$$
F(x,\theta)=x^2-\theta=0.
$$

The solution is

$$
x^*=\sqrt{\theta}
$$

for positive `θ`.

Newton's method computes

$$
x_{k+1} =
x_k -
\frac{x_k^2-\theta}{2x_k}.
$$

Unrolled AD differentiates the finite sequence of Newton steps. The derivative depends on the initial value `x0` and the number of iterations `K`.

Implicit differentiation uses

$$
F_x = 2x,
\qquad
F_\theta = -1.
$$

Therefore,

$$
\frac{dx^*}{d\theta} = -
F_x^{-1}F_\theta =
\frac{1}{2x^*}.
$$

At convergence, this equals the derivative of the true solution. It is independent of the Newton path.

## Reverse Mode Through a Solver

Let a solver produce `x`, and let a scalar loss depend on it:

$$
\ell = L(x,\theta).
$$

If `x` satisfies

$$
F(x,\theta)=0,
$$

then reverse-mode implicit differentiation solves

$$
F_x^T \lambda = \bar{x}^T,
$$

where

$$
\bar{x}=\frac{\partial \ell}{\partial x}.
$$

Then the parameter gradient contribution is

$$
\bar{\theta} = -
\lambda^T F_\theta.
$$

This is the standard backward rule for a root-solving layer.

In implementation, the backward pass usually needs three operations:

```text
vjp_x(v)      = v^T F_x
vjp_theta(v) = v^T F_theta
linear_solve(F_x^T, xbar)
```

The linear solve gives `λ`. The vector-Jacobian product with respect to `θ` gives the final gradient.

## Linear Solvers

Linear systems are the simplest case.

Let

$$
Ax=b.
$$

The residual is

$$
F(x,A,b)=Ax-b.
$$

Differentiate:

$$
A\,dx + dA\,x - db = 0.
$$

So

$$
dx = A^{-1}(db-dA\,x).
$$

For reverse mode, given an upstream adjoint `xbar`, solve

$$
A^T \lambda = \bar{x}.
$$

Then

$$
\bar{b} = \lambda,
$$

and

$$
\bar{A} = -\lambda x^T.
$$

This rule is widely used in differentiable scientific computing. It avoids differentiating through Gaussian elimination, conjugate gradient iterations, or a library call such as LAPACK.

The derivative depends on the equation `Ax=b`, not on the particular algorithm used to solve it.

## Iterative Linear Solvers

For large systems, `A` may never be formed explicitly. A conjugate-gradient solver may only need matrix-vector products:

```text
y = matvec_A(x)
```

The backward pass can also be matrix-free. It needs products with `A^T`.

If `A` is symmetric, then `A^T = A`, and the same matrix-vector product can be reused.

For non-symmetric systems, the implementation must provide a transpose matvec or allow AD to construct one. Without this, reverse-mode differentiation may be impossible or inefficient.

The key interface is:

```text
solve(A, b) -> x
transpose_solve(A, xbar) -> lambda
```

A production AD system should allow custom derivative rules for such solver primitives.

## Nonlinear Solvers

For nonlinear equations,

$$
F(x,\theta)=0,
$$

the forward pass may use Newton, quasi-Newton, trust-region methods, or fixed-point iteration.

The backward pass needs a linearized solve:

$$
F_x^T \lambda = \bar{x}^T.
$$

The matrix `F_x` is the Jacobian of the residual at the final solution.

This has an important consequence: the nonlinear solver and backward linear solver are different computations. A robust forward nonlinear solve does not automatically imply a stable backward solve.

The derivative can be poor when `F_x` is ill-conditioned. Small perturbations in `θ` can cause large changes in `x`.

## Stopping Rules

Solvers usually stop when a tolerance is reached:

```text
while norm(F(x, theta)) > tol:
    x = step(x, theta)
```

This introduces a discontinuity in the execution trace. A tiny change in `θ` can change the number of iterations.

Unrolled AD differentiates the path that occurred. It usually ignores the derivative of the discrete stopping decision.

Implicit AD avoids this specific issue by differentiating the final residual equation. But it assumes the returned value is close enough to a true solution.

A practical rule is:

$$
\|F(x,\theta)\|
$$

must be small enough that the implicit gradient is meaningful.

If the primal residual is large, the backward pass may report a precise derivative of the wrong object.

## Solver Accuracy and Gradient Accuracy

Forward accuracy and gradient accuracy are related but separate.

A solver may produce a value with small forward error but poor sensitivity when the problem is ill-conditioned. Conversely, a moderately inaccurate solution may yield acceptable gradients if the loss is insensitive to the error.

For implicit differentiation, three quantities matter:

| Quantity | Meaning |
|---|---|
| primal residual | how well `x` satisfies `F(x, θ)=0` |
| Jacobian conditioning | sensitivity of the solution map |
| adjoint solve residual | how accurately the backward linear system is solved |

A useful implementation records all three. Silent gradient errors are common when only the forward solver status is checked.

## Differentiating Solver Libraries

Many solvers live outside the AD system: BLAS, LAPACK, PETSc, SUNDIALS, IPOPT, OSQP, custom CUDA kernels, or domain-specific simulators.

An AD system cannot inspect these calls unless they are written in traceable code. There are two common solutions.

The first is wrapping the solver as a primitive with a custom derivative rule.

The second is rewriting the solver in a differentiable language.

Wrapping is usually better for mature numerical libraries. The library already has stable algorithms. The AD system only needs a correct mathematical derivative.

Rewriting is useful when the solver itself contains differentiable modeling choices, learned parameters, or custom control flow.

## Custom Derivative Rule

A solver primitive can be described as:

```text
x = solve(theta)
```

with residual:

```text
F(x, theta) = 0
```

The custom backward rule is:

```text
backward(xbar):
    lambda = solve_transpose_jacobian(F, x, theta, xbar)
    thetabar = - vjp_theta(F, x, theta, lambda)
    return thetabar
```

This separates the primal solver from the derivative logic.

The forward solver may use one algorithm. The backward solve may use another algorithm. Both must correspond to the same residual equation.

## Memory Behavior

Unrolled reverse mode stores intermediate states unless checkpointing is used. For long solvers, this can dominate memory.

Implicit differentiation stores only the final solution and enough data to evaluate the residual and its derivatives.

| Solver length | Unrolled reverse memory | Implicit reverse memory |
|---|---:|---:|
| 10 steps | small | small |
| 1,000 steps | large | small |
| 1,000,000 steps | usually infeasible | possible if residual products are cheap |

This is one of the main reasons implicit methods are used for equilibrium models, PDE solvers, and optimization layers.

## When Unrolling Is Better

Implicit differentiation is not always the right abstraction.

Unrolling is better when the algorithm itself is the model. Examples include learned optimizers, recurrent networks with a fixed number of steps, iterative refinement blocks trained end-to-end, and procedures whose early stopping behavior is intentional.

Unrolling also works when the residual equation is hard to define or when the solver is not actually converging to a well-defined solution.

A practical test is simple:

If changing the solver algorithm should change the model, use unrolled differentiation.

If changing the solver algorithm should preserve the model, use implicit differentiation.

## Summary

Differentiating through solvers requires choosing the right object of differentiation.

Unrolled AD differentiates the computation that was executed. It is general, direct, and often expensive.

Implicit AD differentiates the equation solved by the computation. It is usually more memory efficient and more mathematically stable, provided the residual is correct, the solution is accurate, and the linearized system is well conditioned.

For modern AD systems, solvers should often be treated as differentiable primitives with custom forward and backward rules. This gives the programmer control over the mathematical meaning of the derivative while preserving the performance of specialized numerical code.

