Skip to content

Inverse Problems

An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.

An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.

The usual form is

y=F(θ), y = F(\theta),

where θ\theta is an unknown parameter vector and yy is the model output. In practice, we observe noisy data

zF(θ). z \approx F(\theta).

The inverse problem is to estimate θ\theta from zz.

Examples include seismic imaging, medical tomography, material parameter estimation, source reconstruction, system identification, and calibration of physical simulations.

Forward and Inverse Maps

The forward map is usually well-defined:

θF(θ). \theta \mapsto F(\theta).

The inverse map may be unstable, non-unique, or only partially defined.

ProblemForward directionInverse direction
Heat equationInitial temperature gives later temperatureRecover initial temperature from later temperature
CT scanTissue density gives projectionsRecover density from projections
Seismic imagingEarth model gives waveformsRecover subsurface structure
Material fittingMaterial parameters give deformationRecover parameters from measured deformation

Automatic differentiation is useful because inverse problems are often solved by optimization. The gradient of the mismatch between simulated and observed data gives the direction for improving the parameter estimate.

Least-Squares Formulation

A common formulation defines a residual

r(θ)=F(θ)z. r(\theta) = F(\theta) - z.

The loss is

L(θ)=12r(θ)2. L(\theta) = \frac{1}{2}\|r(\theta)\|^2.

The gradient is

θL=J(θ)r(θ), \nabla_\theta L = J(\theta)^\top r(\theta),

where

J(θ)=Fθ. J(\theta)=\frac{\partial F}{\partial \theta}.

This equation explains why reverse-mode AD is central. We usually do not need the full Jacobian. We need the product JrJ^\top r, which is a vector-Jacobian product.

Ill-Posedness

Many inverse problems are ill-posed. Small errors in the data can cause large errors in the recovered parameters.

A well-posed problem should have:

PropertyMeaning
ExistenceA solution exists
UniquenessThe solution is determined by the data
StabilitySmall data changes cause small solution changes

Inverse problems often violate uniqueness or stability. For example, many parameter settings may produce almost identical observations.

Automatic differentiation gives accurate gradients, but accurate gradients do not remove ill-posedness. The model, data, and objective must still be designed carefully.

Regularization

Regularization adds prior structure to the solution. Instead of minimizing only data mismatch,

12F(θ)z2, \frac{1}{2}\|F(\theta)-z\|^2,

we minimize

L(θ)=12F(θ)z2+λR(θ). L(\theta) = \frac{1}{2}\|F(\theta)-z\|^2 + \lambda R(\theta).

Here R(θ)R(\theta) penalizes undesirable solutions, and λ\lambda controls the strength of the penalty.

Common regularizers include:

RegularizerEffect
θ2\|\theta\|^2Prefers small parameters
θ2\|\nabla \theta\|^2Prefers smooth fields
θ1\|\theta\|_1Encourages sparsity
Total variationPreserves edges while reducing noise
Physics constraintsEnforces known conservation laws

AD computes gradients for both the forward mismatch and the regularization term, provided both are implemented as differentiable programs.

Adjoint Methods

Inverse problems often have many parameters and relatively few scalar objectives. Reverse-mode AD and adjoint methods are therefore natural.

Suppose the forward model is defined by a differential equation:

G(u,θ)=0, G(u,\theta)=0,

where uu is the state and θ\theta is the parameter. The loss is

L(u,θ). L(u,\theta).

Direct differentiation gives

Gududθ+Gθ=0. G_u \frac{du}{d\theta} + G_\theta = 0.

So

$$ \frac{du}{d\theta} =

  • G_u^{-1}G_\theta. $$

Substituting into the derivative of LL would require solving one system per parameter. This is too expensive when θ\theta is large.

The adjoint method avoids this. Define an adjoint variable λ\lambda by

Guλ=Lu. G_u^\top \lambda = L_u^\top.

Then the gradient is

θL=LθGθλ. \nabla_\theta L = L_\theta - G_\theta^\top \lambda.

This requires one adjoint solve per scalar loss, rather than one forward sensitivity solve per parameter.

Discrete Inverse Problems

Many inverse problems are solved after discretization. The state becomes a vector uu, the parameters become a vector θ\theta, and the governing equation becomes a finite-dimensional system.

For example:

A(θ)u=b. A(\theta)u = b.

The observation model might be

F(θ)=Hu, F(\theta)=Hu,

where HH selects measured components. The loss is

L(θ)=12Huz2. L(\theta) = \frac{1}{2}\|Hu-z\|^2.

AD can differentiate the full computational path:

θA(θ)u=A(θ)1bHuL. \theta \to A(\theta) \to u=A(\theta)^{-1}b \to Hu \to L.

For efficiency, the linear solve should have a custom derivative rule. Reverse mode uses a transpose solve rather than differentiating through every iteration of an iterative solver.

Differentiating Through Solvers

Inverse problems often contain numerical solvers:

Solver typeExample
Linear solverAx=bAx=b
Nonlinear solverF(x,θ)=0F(x,\theta)=0
ODE solverTime integration
PDE solverFinite element simulation
Optimization solverInner minimization

There are two main differentiation strategies.

The first is unrolled differentiation. We differentiate through every solver iteration. This is simple and matches the implemented computation, but it can be memory-heavy and sensitive to iteration count.

The second is implicit differentiation. We differentiate the equation solved at convergence. This is often cleaner and cheaper, but it assumes the solver reached a meaningful fixed point.

StrategyDifferentiatesAdvantageCost
Unrolled ADActual iterationsExact for the executed programHigh memory for many iterations
Implicit ADConverged equationAvoids long tapesRequires linearized solve

Identifiability

Identifiability asks whether the parameters can be determined from the observations.

If two different parameters produce the same output,

F(θ1)=F(θ2), F(\theta_1)=F(\theta_2),

then the inverse problem cannot distinguish them.

Local identifiability is related to the rank of the Jacobian:

J=Fθ. J = \frac{\partial F}{\partial \theta}.

If JJ has deficient rank, there are parameter directions that do not change the observations to first order.

These directions are called null directions:

Jv=0. Jv = 0.

Moving along such a direction leaves the output locally unchanged.

AD helps identify these directions through Jacobian-vector products, vector-Jacobian products, and approximate Hessian methods.

Gauss-Newton Methods

For nonlinear least squares,

L(θ)=12r(θ)2, L(\theta)=\frac{1}{2}\|r(\theta)\|^2,

the Hessian is

2L=JJ+iri2ri. \nabla^2 L = J^\top J + \sum_i r_i \nabla^2 r_i.

Gauss-Newton drops the second term:

HGN=JJ. H_{\text{GN}} = J^\top J.

The update solves

$$ J^\top J \Delta \theta =

  • J^\top r. $$

This method is powerful when residuals are small or the model is close to linear near the solution.

AD supports Gauss-Newton without materializing JJ. We only need products:

Jv Jv

and

Jw. J^\top w.

Forward mode gives JvJv. Reverse mode gives JwJ^\top w.

Bayesian Inverse Problems

A Bayesian inverse problem treats parameters as random variables. Instead of returning one estimate, it returns a posterior distribution:

p(θz)p(zθ)p(θ). p(\theta \mid z) \propto p(z \mid \theta)p(\theta).

The negative log posterior often becomes an optimization objective:

L(θ)=logp(zθ)logp(θ). L(\theta) = -\log p(z \mid \theta) - \log p(\theta).

AD provides gradients for sampling and variational methods, including:

MethodUses gradients for
Hamiltonian Monte CarloSimulating posterior dynamics
Langevin dynamicsGradient-informed sampling
Variational inferenceOptimizing approximate posterior
Laplace approximationComputing local curvature

In this setting, derivatives support uncertainty quantification, not only point estimation.

Noise and Observation Models

The loss function should match the noise model.

If observation noise is Gaussian,

z=F(θ)+ϵ,ϵN(0,σ2I), z = F(\theta) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2 I),

then least squares is natural.

If noise is not Gaussian, another likelihood may be better.

Noise modelTypical loss
GaussianSquared error
LaplaceAbsolute error
PoissonPoisson negative log likelihood
BernoulliCross entropy
Heavy-tailedRobust losses

AD makes it easy to change the loss, but the statistical meaning changes with it.

Constraints

Many inverse problems have constraints:

θC. \theta \in C.

Examples include positivity, conservation laws, bounds, smoothness, monotonicity, and geometric feasibility.

Constraints can be handled by:

MethodIdea
ReparameterizationWrite θ=g(ϕ)\theta = g(\phi) so constraints hold automatically
Penalty methodsAdd constraint violation to loss
Projected methodsProject updates back into feasible set
Barrier methodsPrevent crossing constraint boundaries
Constrained solversSolve KKT systems directly

AD supplies derivatives for the objective and constraint functions. The optimization algorithm must still enforce feasibility.

Failure Modes

Inverse problems fail in characteristic ways.

Failure modeCause
Non-unique solutionInsufficient observations
Unstable solutionIll-conditioned forward map
Overfitting noiseWeak regularization
Biased estimateWrong model class
Bad gradientDiscontinuous solver logic
Slow convergencePoor scaling or conditioning
False confidenceIgnored uncertainty

AD solves the derivative computation problem. It does not solve the modeling problem.

Practical Design Pattern

A practical differentiable inverse problem usually has this structure:

parameters theta
    -> constrained parameter transform
    -> physical or statistical forward model
    -> numerical solver
    -> observation operator
    -> residual
    -> regularized loss
    -> gradient by AD
    -> optimizer or sampler

Each stage should have clear derivative semantics. The most important design decision is where to use general AD and where to provide custom rules.

Good candidates for custom rules include:

ComponentReason
Linear solvesUse transpose solves
Nonlinear fixed pointsAvoid differentiating many iterations
ODE/PDE solversControl memory and stability
InterpolationDefine consistent boundary behavior
Discontinuous eventsExpose piecewise derivative semantics

Summary

Inverse problems recover hidden causes from observed effects. They are usually solved by minimizing a mismatch between simulated and observed data, often with regularization or Bayesian priors.

Automatic differentiation is central because it provides gradients, adjoints, Jacobian products, and Hessian approximations for complex forward models. However, inverse problems remain limited by identifiability, conditioning, noise, model error, and constraints.

The best AD implementations for inverse problems are solver-aware. They combine reverse mode, implicit differentiation, sparse linear algebra, and checkpointing rather than differentiating every low-level operation blindly.