An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.
An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.
The usual form is
where is an unknown parameter vector and is the model output. In practice, we observe noisy data
The inverse problem is to estimate from .
Examples include seismic imaging, medical tomography, material parameter estimation, source reconstruction, system identification, and calibration of physical simulations.
Forward and Inverse Maps
The forward map is usually well-defined:
The inverse map may be unstable, non-unique, or only partially defined.
| Problem | Forward direction | Inverse direction |
|---|---|---|
| Heat equation | Initial temperature gives later temperature | Recover initial temperature from later temperature |
| CT scan | Tissue density gives projections | Recover density from projections |
| Seismic imaging | Earth model gives waveforms | Recover subsurface structure |
| Material fitting | Material parameters give deformation | Recover parameters from measured deformation |
Automatic differentiation is useful because inverse problems are often solved by optimization. The gradient of the mismatch between simulated and observed data gives the direction for improving the parameter estimate.
Least-Squares Formulation
A common formulation defines a residual
The loss is
The gradient is
where
This equation explains why reverse-mode AD is central. We usually do not need the full Jacobian. We need the product , which is a vector-Jacobian product.
Ill-Posedness
Many inverse problems are ill-posed. Small errors in the data can cause large errors in the recovered parameters.
A well-posed problem should have:
| Property | Meaning |
|---|---|
| Existence | A solution exists |
| Uniqueness | The solution is determined by the data |
| Stability | Small data changes cause small solution changes |
Inverse problems often violate uniqueness or stability. For example, many parameter settings may produce almost identical observations.
Automatic differentiation gives accurate gradients, but accurate gradients do not remove ill-posedness. The model, data, and objective must still be designed carefully.
Regularization
Regularization adds prior structure to the solution. Instead of minimizing only data mismatch,
we minimize
Here penalizes undesirable solutions, and controls the strength of the penalty.
Common regularizers include:
| Regularizer | Effect |
|---|---|
| Prefers small parameters | |
| Prefers smooth fields | |
| Encourages sparsity | |
| Total variation | Preserves edges while reducing noise |
| Physics constraints | Enforces known conservation laws |
AD computes gradients for both the forward mismatch and the regularization term, provided both are implemented as differentiable programs.
Adjoint Methods
Inverse problems often have many parameters and relatively few scalar objectives. Reverse-mode AD and adjoint methods are therefore natural.
Suppose the forward model is defined by a differential equation:
where is the state and is the parameter. The loss is
Direct differentiation gives
So
$$ \frac{du}{d\theta} =
- G_u^{-1}G_\theta. $$
Substituting into the derivative of would require solving one system per parameter. This is too expensive when is large.
The adjoint method avoids this. Define an adjoint variable by
Then the gradient is
This requires one adjoint solve per scalar loss, rather than one forward sensitivity solve per parameter.
Discrete Inverse Problems
Many inverse problems are solved after discretization. The state becomes a vector , the parameters become a vector , and the governing equation becomes a finite-dimensional system.
For example:
The observation model might be
where selects measured components. The loss is
AD can differentiate the full computational path:
For efficiency, the linear solve should have a custom derivative rule. Reverse mode uses a transpose solve rather than differentiating through every iteration of an iterative solver.
Differentiating Through Solvers
Inverse problems often contain numerical solvers:
| Solver type | Example |
|---|---|
| Linear solver | |
| Nonlinear solver | |
| ODE solver | Time integration |
| PDE solver | Finite element simulation |
| Optimization solver | Inner minimization |
There are two main differentiation strategies.
The first is unrolled differentiation. We differentiate through every solver iteration. This is simple and matches the implemented computation, but it can be memory-heavy and sensitive to iteration count.
The second is implicit differentiation. We differentiate the equation solved at convergence. This is often cleaner and cheaper, but it assumes the solver reached a meaningful fixed point.
| Strategy | Differentiates | Advantage | Cost |
|---|---|---|---|
| Unrolled AD | Actual iterations | Exact for the executed program | High memory for many iterations |
| Implicit AD | Converged equation | Avoids long tapes | Requires linearized solve |
Identifiability
Identifiability asks whether the parameters can be determined from the observations.
If two different parameters produce the same output,
then the inverse problem cannot distinguish them.
Local identifiability is related to the rank of the Jacobian:
If has deficient rank, there are parameter directions that do not change the observations to first order.
These directions are called null directions:
Moving along such a direction leaves the output locally unchanged.
AD helps identify these directions through Jacobian-vector products, vector-Jacobian products, and approximate Hessian methods.
Gauss-Newton Methods
For nonlinear least squares,
the Hessian is
Gauss-Newton drops the second term:
The update solves
$$ J^\top J \Delta \theta =
- J^\top r. $$
This method is powerful when residuals are small or the model is close to linear near the solution.
AD supports Gauss-Newton without materializing . We only need products:
and
Forward mode gives . Reverse mode gives .
Bayesian Inverse Problems
A Bayesian inverse problem treats parameters as random variables. Instead of returning one estimate, it returns a posterior distribution:
The negative log posterior often becomes an optimization objective:
AD provides gradients for sampling and variational methods, including:
| Method | Uses gradients for |
|---|---|
| Hamiltonian Monte Carlo | Simulating posterior dynamics |
| Langevin dynamics | Gradient-informed sampling |
| Variational inference | Optimizing approximate posterior |
| Laplace approximation | Computing local curvature |
In this setting, derivatives support uncertainty quantification, not only point estimation.
Noise and Observation Models
The loss function should match the noise model.
If observation noise is Gaussian,
then least squares is natural.
If noise is not Gaussian, another likelihood may be better.
| Noise model | Typical loss |
|---|---|
| Gaussian | Squared error |
| Laplace | Absolute error |
| Poisson | Poisson negative log likelihood |
| Bernoulli | Cross entropy |
| Heavy-tailed | Robust losses |
AD makes it easy to change the loss, but the statistical meaning changes with it.
Constraints
Many inverse problems have constraints:
Examples include positivity, conservation laws, bounds, smoothness, monotonicity, and geometric feasibility.
Constraints can be handled by:
| Method | Idea |
|---|---|
| Reparameterization | Write so constraints hold automatically |
| Penalty methods | Add constraint violation to loss |
| Projected methods | Project updates back into feasible set |
| Barrier methods | Prevent crossing constraint boundaries |
| Constrained solvers | Solve KKT systems directly |
AD supplies derivatives for the objective and constraint functions. The optimization algorithm must still enforce feasibility.
Failure Modes
Inverse problems fail in characteristic ways.
| Failure mode | Cause |
|---|---|
| Non-unique solution | Insufficient observations |
| Unstable solution | Ill-conditioned forward map |
| Overfitting noise | Weak regularization |
| Biased estimate | Wrong model class |
| Bad gradient | Discontinuous solver logic |
| Slow convergence | Poor scaling or conditioning |
| False confidence | Ignored uncertainty |
AD solves the derivative computation problem. It does not solve the modeling problem.
Practical Design Pattern
A practical differentiable inverse problem usually has this structure:
parameters theta
-> constrained parameter transform
-> physical or statistical forward model
-> numerical solver
-> observation operator
-> residual
-> regularized loss
-> gradient by AD
-> optimizer or samplerEach stage should have clear derivative semantics. The most important design decision is where to use general AD and where to provide custom rules.
Good candidates for custom rules include:
| Component | Reason |
|---|---|
| Linear solves | Use transpose solves |
| Nonlinear fixed points | Avoid differentiating many iterations |
| ODE/PDE solvers | Control memory and stability |
| Interpolation | Define consistent boundary behavior |
| Discontinuous events | Expose piecewise derivative semantics |
Summary
Inverse problems recover hidden causes from observed effects. They are usually solved by minimizing a mismatch between simulated and observed data, often with regularization or Bayesian priors.
Automatic differentiation is central because it provides gradients, adjoints, Jacobian products, and Hessian approximations for complex forward models. However, inverse problems remain limited by identifiability, conditioning, noise, model error, and constraints.
The best AD implementations for inverse problems are solver-aware. They combine reverse mode, implicit differentiation, sparse linear algebra, and checkpointing rather than differentiating every low-level operation blindly.