Skip to content

Chapter 14. Scientific Computing Applications

Differential equations are one of the main reasons automatic differentiation matters in scientific computing. Many scientific models are not written as closed-form functions....

Differential Equations

Differential equations are one of the main reasons automatic differentiation matters in scientific computing. Many scientific models are not written as closed-form functions. They are written as equations describing how a system changes over time, space, or both.

A simple ordinary differential equation has the form

dy(t)dt=f(y(t),t,θ), \frac{dy(t)}{dt} = f(y(t), t, \theta),

where y(t)y(t) is the state, tt is time, and θ\theta is a vector of parameters. Given an initial condition

y(t0)=y0, y(t_0) = y_0,

a numerical solver computes an approximate trajectory

y(t0),y(t1),,y(tN). y(t_0), y(t_1), \ldots, y(t_N).

In many applications, we do not only want the trajectory. We also want derivatives of some quantity derived from the trajectory. For example, we may define a loss

L(θ)=(y(tN),θ), L(\theta) = \ell(y(t_N), \theta),

and ask for

dLdθ. \frac{dL}{d\theta}.

This is the central problem: differentiate through the solution of a differential equation.

The Solver as a Program

A numerical ODE solver is a program. Automatic differentiation applies to programs, so in principle we can differentiate the solver directly.

For explicit Euler, one step is

yk+1=yk+hf(yk,tk,θ). y_{k+1} = y_k + h f(y_k, t_k, \theta).

The full solver is the repeated composition

yN=ΦN1Φ1Φ0(y0,θ), y_N = \Phi_{N-1} \circ \cdots \circ \Phi_1 \circ \Phi_0(y_0, \theta),

where each Φk\Phi_k is one numerical step.

Forward mode propagates sensitivities alongside the state. If

Sk=ykθ, S_k = \frac{\partial y_k}{\partial \theta},

then Euler gives

Sk+1=Sk+h(fy(yk,tk,θ)Sk+fθ(yk,tk,θ)). S_{k+1} = S_k + h \left( \frac{\partial f}{\partial y}(y_k,t_k,\theta)S_k + \frac{\partial f}{\partial \theta}(y_k,t_k,\theta) \right).

This is called the sensitivity equation in discrete form. It is often the most direct way to compute parameter derivatives when the number of parameters is small.

Continuous Sensitivity Equations

Instead of differentiating the numerical solver, we can differentiate the differential equation itself.

Starting from

dydt=f(y,t,θ), \frac{dy}{dt} = f(y,t,\theta),

differentiate both sides with respect to θ\theta. Let

S(t)=y(t)θ. S(t)=\frac{\partial y(t)}{\partial \theta}.

Then

dSdt=fyS+fθ. \frac{dS}{dt} = \frac{\partial f}{\partial y}S + \frac{\partial f}{\partial \theta}.

The original state and its sensitivity can be solved together:

ddt[yS]=[f(y,t,θ)fy(y,t,θ)S+fθ(y,t,θ)]. \frac{d}{dt} \begin{bmatrix} y \\ S \end{bmatrix} = \begin{bmatrix} f(y,t,\theta) \\ f_y(y,t,\theta)S + f_\theta(y,t,\theta) \end{bmatrix}.

This method is useful when the state dimension is moderate and the parameter dimension is not too large. Its cost grows with the number of parameters, because SS has one column per parameter.

Reverse Mode and Adjoint Equations

When the number of parameters is large and the loss is scalar, reverse mode is usually preferable. This is common in optimization and machine learning.

Suppose the loss depends on the final state:

L=(y(T)). L = \ell(y(T)).

The continuous adjoint variable is

a(t)=Ly(t). a(t) = \frac{\partial L}{\partial y(t)}.

For the ODE

dydt=f(y,t,θ), \frac{dy}{dt}=f(y,t,\theta),

the adjoint equation runs backward in time:

$$ \frac{da}{dt} =

  • a^\top \frac{\partial f}{\partial y}. $$

The terminal condition is

a(T)=y(T). a(T)=\frac{\partial \ell}{\partial y(T)}.

The parameter gradient is accumulated as

dLdθ=t0Ta(t)fθ(y(t),t,θ)dt. \frac{dL}{d\theta} = \int_{t_0}^{T} a(t)^\top \frac{\partial f}{\partial \theta}(y(t),t,\theta) \,dt.

This is the continuous analogue of reverse-mode AD. Instead of storing every intermediate operation in a tape, the adjoint method solves another differential equation backward.

Discrete Adjoint vs Continuous Adjoint

There are two different objects that are often confused.

A discrete adjoint differentiates the numerical solver exactly. If the solver takes steps

y0y1yN, y_0 \to y_1 \to \cdots \to y_N,

then reverse-mode AD propagates adjoints backward through those exact steps.

A continuous adjoint differentiates the mathematical ODE first, then discretizes the adjoint equation.

These methods can give different gradients, because differentiation and discretization do not always commute.

MethodWhat is differentiatedMain advantageMain risk
Discrete adjointThe numerical solverGradient matches computed solutionMay require storing solver states
Continuous adjointThe continuous ODELower memory in some systemsGradient may differ from solver gradient
Forward sensitivityState equation plus sensitivity equationSimple and stable for few parametersCost grows with parameter count

For machine learning, the discrete adjoint is often the safer interpretation: it gives the derivative of the actual computation performed. For scientific modeling, the continuous adjoint may be more natural when the differential equation is the primary object.

Differentiating Through Adaptive Solvers

Many ODE solvers use adaptive step sizes. The solver chooses hkh_k based on an error estimate. This creates extra complications.

A simplified adaptive solver does something like:

hk+1=g(hk,errork). h_{k+1} = g(h_k, \text{error}_k).

The step size now depends on the state and parameters. A fully discrete derivative must account for this dependence.

However, many practical AD systems treat solver control decisions as fixed during differentiation. That means the gradient is computed through the accepted numerical steps, but not through the logic that chose those steps. This is usually acceptable when step-size decisions are stable under small perturbations. It can fail near discontinuous accept/reject boundaries.

Adaptive solvers therefore introduce three derivative layers:

LayerMeaning
Model derivativeDerivative of f(y,t,θ)f(y,t,\theta)
Solver derivativeDerivative through numerical updates
Control derivativeDerivative through step-size and branch decisions

The first two are standard. The third is delicate because solver control flow is piecewise constant or discontinuous.

Stiff Differential Equations

A stiff ODE contains dynamics on very different time scales. Explicit methods may require extremely small steps for stability. Implicit solvers are often used instead.

An implicit Euler step is defined by

yk+1=yk+hf(yk+1,tk+1,θ). y_{k+1} = y_k + h f(y_{k+1}, t_{k+1}, \theta).

Here yk+1y_{k+1} is defined implicitly. The solver usually finds it with Newton iteration.

Differentiating this step can be done in two ways.

The first way is to differentiate through every Newton iteration. This is simple if the solver is implemented in an AD system, but it can be expensive and may expose numerical details that are not mathematically relevant.

The second way is implicit differentiation. Define

F(yk+1,yk,θ)=yk+1ykhf(yk+1,tk+1,θ). F(y_{k+1}, y_k, \theta) = y_{k+1} - y_k - h f(y_{k+1}, t_{k+1}, \theta).

Since

F(yk+1,yk,θ)=0, F(y_{k+1}, y_k, \theta)=0,

we differentiate implicitly:

Fyk+1yk+1θ+Fθ=0. \frac{\partial F}{\partial y_{k+1}} \frac{\partial y_{k+1}}{\partial \theta} + \frac{\partial F}{\partial \theta} =0.

Thus

yk+1θ=(Fyk+1)1Fθ. \frac{\partial y_{k+1}}{\partial \theta} = - \left( \frac{\partial F}{\partial y_{k+1}} \right)^{-1} \frac{\partial F}{\partial \theta}.

This avoids differentiating through the internal iterations of the nonlinear solver. It treats the converged implicit step as the mathematical operation.

PDEs and Discretized Systems

Partial differential equations introduce derivatives over both time and space. A PDE such as

ut=D2ux2 \frac{\partial u}{\partial t} = D \frac{\partial^2 u}{\partial x^2}

is usually discretized into a large system of ODEs. For example, after spatial discretization, we may get

dudt=A(D)u. \frac{d\mathbf{u}}{dt} = A(D)\mathbf{u}.

Automatic differentiation then applies to the discretized program: mesh construction, finite difference stencils, finite element assembly, linear solves, time stepping, and loss evaluation.

The challenge is scale. PDE solvers may contain millions or billions of state variables. Reverse-mode AD over the entire solver can be memory-heavy. Efficient systems use structure:

StructureAD consequence
Sparse matricesSparse Jacobian and adjoint operations
Local stencilsLocal derivative propagation
Linear solvesCustom VJP rules using transpose solves
Mesh hierarchyMultilevel adjoint methods
Time steppingCheckpointing and recomputation

The key is to avoid materializing dense Jacobians. Most scientific AD systems compute products such as

Jv Jv

or

vJ v^\top J

without constructing JJ explicitly.

Differentiating Linear Solves

Linear solves appear constantly in differential equation solvers. Suppose

x=A1b, x = A^{-1}b,

or equivalently

Ax=b. Ax=b.

For a perturbation,

Adx+dAx=db. A\,dx + dA\,x = db.

Therefore

dx=A1(dbdAx). dx = A^{-1}(db - dA\,x).

In reverse mode, if the adjoint of xx is xˉ\bar{x}, then the adjoints are computed using

Aλ=xˉ, A^\top \lambda = \bar{x},

followed by

bˉ=λ, \bar{b} = \lambda,

and

Aˉ=λx. \bar{A} = -\lambda x^\top.

This rule is central for differentiating implicit time steps, finite element methods, optimization layers, and constrained physical systems.

Memory and Checkpointing

Reverse-mode differentiation through a long time integration requires access to previous states. A naive implementation stores every state:

y0,y1,,yN. y_0, y_1, \ldots, y_N.

This has memory cost O(Ndimy)O(N \cdot \dim y).

Checkpointing reduces memory by storing only selected states and recomputing missing states during the backward pass. This trades extra computation for lower memory.

StrategyMemoryExtra compute
Store all statesHighLow
Store no statesLowHigh
CheckpointingMediumMedium
Revolve-style schedulesNear optimalControlled

For long simulations, checkpointing is often necessary. Without it, reverse-mode AD may be unusable even when the mathematical gradient is well-defined.

Practical Design Rule

A robust differentiable differential equation system usually separates four layers:

LayerResponsibility
ModelDefines f(y,t,θ)f(y,t,\theta)
SolverAdvances the state
Linear algebraSolves sparse or dense systems
AD rulesDefine JVPs and VJPs for each primitive

This separation matters. Differentiating through every low-level operation is general, but often inefficient. Scientific solvers usually need custom derivative rules for linear solves, nonlinear solves, interpolation, event handling, and time stepping.

Example: Parameter Estimation

Suppose a physical system is modeled by

dydt=f(y,t,θ). \frac{dy}{dt}=f(y,t,\theta).

We observe measurements

ziy(ti) z_i \approx y(t_i)

and define

L(θ)=12iy(ti;θ)zi2. L(\theta) = \frac{1}{2} \sum_i \|y(t_i;\theta)-z_i\|^2.

To fit θ\theta, we need θL\nabla_\theta L. AD provides this gradient by differentiating the solve procedure. An optimizer then updates θ\theta:

θk+1=θkηθL(θk). \theta_{k+1} = \theta_k - \eta \nabla_\theta L(\theta_k).

This gives a standard inverse-problem loop:

  1. solve the differential equation,
  2. compare the solution with observations,
  3. differentiate the loss,
  4. update the parameters.

The same pattern appears in system identification, climate modeling, robotics, molecular simulation, pharmacokinetics, and neural differential equations.

Summary

Differential equations turn AD from a local derivative tool into a method for differentiating entire simulations. The main issue is not whether derivatives exist. The main issue is how to compute them with acceptable accuracy, memory use, and runtime.

Forward sensitivity methods are simple and reliable when the parameter dimension is small. Reverse and adjoint methods are better when a scalar loss depends on many parameters. Discrete adjoints differentiate the actual numerical computation. Continuous adjoints differentiate the mathematical model and then solve the adjoint equation.

For production scientific computing, efficient AD requires more than applying reverse mode blindly. It needs solver-aware derivative rules, sparse linear algebra, checkpointing, and careful treatment of adaptive control flow.