Continuous-Time Adjoint Methods

Many systems evolve continuously over time rather than through discrete layers. A state variable changes according to a differential equation:

\frac{dz(t)}{dt} = f(z(t), \theta, t).

The output is not a single value but an entire trajectory

z(t), \quad t \in [t_0, t_1].

If a loss depends on the final state or on the full trajectory, we need derivatives with respect to parameters θ, initial conditions, or forcing terms.

A direct reverse-mode differentiation of every numerical integration step can be expensive in memory. Continuous-time adjoint methods provide an alternative formulation based on differential equations for the gradient itself.

These methods are central in optimal control, scientific computing, inverse problems, differentiable simulation, and neural differential equations.

Differential Equation as a Computational Graph

Consider the ordinary differential equation (ODE)

\dot{z}(t)=f(z(t),\theta,t), \qquad z(t_0)=z_0.

A numerical solver approximates the trajectory at discrete times:

z0 -> z1 -> z2 -> ... -> zN

Each step depends on the previous state.

For example, explicit Euler gives

z_{k+1}=z_k+h f(z_k,\theta,t_k),

where h is the step size.

A normal AD system can differentiate this discrete computation directly. Reverse mode propagates gradients backward through every integration step.

This is correct for the discrete numerical method, but it has two major costs:

Issue	Consequence
storing all states	high memory usage
long reverse chain	numerical instability

Continuous adjoint methods derive a differential equation for the backward pass itself.

Loss Depending on Final State

Assume the loss depends only on the terminal state:

\ell = L(z(t_1)).

We want

\frac{d\ell}{d\theta}.

The key idea is to introduce an adjoint state:

a(t) = \frac{\partial \ell}{\partial z(t)}.

This measures sensitivity of the final loss with respect to perturbations of the state at time t.

At the terminal time,

a(t_1) = \frac{\partial L}{\partial z(t_1)}.

The question is how this adjoint evolves backward in time.

Deriving the Adjoint Equation

A perturbation δz(t) evolves according to the linearized dynamics:

\frac{d}{dt}\delta z = f_z \delta z + f_\theta \delta \theta,

where

f_z = \frac{\partial f}{\partial z}, \qquad f_\theta = \frac{\partial f}{\partial \theta}.

We want an equation for the adjoint a(t) such that the loss variation can be expressed in terms of parameter perturbations.

The continuous adjoint equation is

\frac{da}{dt} = - a f_z.

Equivalently, in column-vector convention,

\frac{da}{dt} = - f_z^T a.

The terminal condition is

a(t_1)=\frac{\partial L}{\partial z(t_1)}.

Thus the adjoint is solved backward in time.

Gradient Formula

Once the adjoint trajectory is known, the parameter gradient is

\frac{d\ell}{d\theta} = \int_{t_0}^{t_1} a(t) f_\theta(z(t),\theta,t) \,dt.

In column-vector notation,

\frac{d\ell}{d\theta} = \int_{t_0}^{t_1} f_\theta^T a(t) \,dt.

This is the central result of the continuous adjoint method.

The forward pass solves the primal ODE forward in time.

The backward pass solves the adjoint ODE backward in time.

The parameter gradient accumulates along the backward trajectory.

Structure of the Method

The complete computation has three stages.

Forward solve

Solve

\dot{z}=f(z,\theta,t)

from t0 to t1.

Adjoint initialization

Set

a(t_1)=\frac{\partial L}{\partial z(t_1)}.

Backward solve

Solve backward:

\dot{a}=-f_z^T a.

Simultaneously accumulate:

\frac{d\ell}{d\theta} = \int f_\theta^T a \, dt.

This turns reverse-mode differentiation of a continuous dynamical system into another differential equation.

Relationship to Reverse Mode AD

Continuous adjoints are the continuous analogue of reverse accumulation.

In discrete reverse mode:

x0 -> x1 -> x2 -> ... -> xN

the backward pass propagates adjoints:

xbarN -> xbarN-1 -> ... -> xbar0

using transpose Jacobians.

In continuous systems, the transpose Jacobian becomes the adjoint differential equation:

\dot{a}=-f_z^T a.

The continuous adjoint method is therefore a limiting form of reverse-mode AD for infinitely small steps.

Example: Linear ODE

Consider

\dot{z} = \theta z.

The solution is

z(t)=z_0 e^{\theta t}.

Suppose

\ell = \frac{1}{2}(z(t_1)-y)^2.

Then

a(t_1)=z(t_1)-y.

The Jacobian is

f_z = \theta.

Thus the adjoint equation is

\dot{a}=-\theta a.

The backward solution is

a(t)=a(t_1)e^{-\theta(t-t_1)}.

The parameter derivative uses

f_\theta = z.

Hence

\frac{d\ell}{d\theta} = \int_{t_0}^{t_1} a(t) z(t)\,dt.

The forward and backward trajectories combine to produce the gradient.

Memory Advantage

A major motivation for continuous adjoints is memory efficiency.

Discrete reverse mode stores every integration state:

Number of steps	Stored states
100	100
1,000,000	1,000,000

Continuous adjoint methods attempt to avoid storing the full trajectory. Instead, they reconstruct states during the backward solve.

The idealized memory complexity becomes approximately constant with respect to integration length.

This made continuous adjoints attractive for neural ODE systems.

Reconstruction Problem

In practice, the backward solve usually needs access to the forward trajectory:

z(t).

If states are not stored, they must be recomputed.

This creates several complications.

Numerical non-reversibility

ODE solvers are not perfectly reversible in floating point arithmetic.

Integrating backward from

z(t_1)

does not necessarily recover the exact forward trajectory.

Adaptive step mismatch

Adaptive solvers choose time steps dynamically. The backward integration may choose different steps from the forward pass.

Chaotic dynamics

Small reconstruction errors can grow exponentially in unstable systems.

As a result, many practical implementations checkpoint or partially store forward states instead of relying on perfect reconstruction.

Discrete vs Continuous Gradients

There is an important conceptual distinction.

A numerical ODE solver defines a discrete computation. Reverse-mode AD of the solver gives the exact gradient of that discrete computation.

The continuous adjoint method derives gradients from the continuous differential equation itself.

These gradients are not always identical.

Method	Differentiates
discrete reverse AD	numerical integrator
continuous adjoint	underlying ODE

As step size approaches zero, the two often converge. At finite precision and finite step size, they may differ substantially.

For scientific computing, the discrete derivative is often more accurate relative to the actual implemented computation.

For theoretical analysis, the continuous derivative is often cleaner.

Adjoint for General Losses

Suppose the loss depends on the full trajectory:

\ell = \int_{t_0}^{t_1} r(z(t),\theta,t)\,dt + L(z(t_1)).

Then the adjoint equation becomes

\dot{a} = - f_z^T a - r_z^T.

The terminal condition remains

a(t_1)=L_z^T.

The parameter gradient is

\frac{d\ell}{d\theta} = \int_{t_0}^{t_1} \left( f_\theta^T a + r_\theta \right) dt.

Trajectory-dependent objectives are common in optimal control and physical simulation.

PDE Adjoint Methods

The same principles extend to partial differential equations.

A PDE-constrained optimization problem may minimize

J(u,\theta)

subject to

\mathcal{N}(u,\theta)=0,

where u is a field variable and 𝒩 is a differential operator.

Direct differentiation is expensive because the state dimension may be enormous.

Adjoint PDE methods solve a second PDE backward to compute gradients efficiently.

This is one of the main computational tools in:

Domain	Application
fluid dynamics	shape optimization
geophysics	seismic inversion
climate modeling	parameter estimation
optics	inverse design
aerodynamics	optimal control

The cost of one adjoint solve is often nearly independent of the number of parameters.

Optimal Control Interpretation

Continuous adjoints originated in optimal control theory.

A control system evolves as

\dot{z}=f(z,u,t),

where u(t) is a control input.

The objective may be

\ell = \int r(z,u,t)\,dt + L(z(t_1)).

Pontryagin’s maximum principle introduces a co-state variable, which is mathematically the same object as the adjoint state in reverse-mode differentiation.

Thus modern differentiable ODE systems and classical control theory share the same mathematical structure.

Continuous Adjoint in Neural ODEs

Neural ODEs define a neural network through continuous dynamics:

\dot{z}=f_\theta(z,t).

Instead of stacking discrete layers,

z0 -> layer1 -> layer2 -> layer3

the network integrates an ODE.

The continuous adjoint method was proposed as a memory-efficient backward pass.

The backward solve integrates:

\dot{a}=-f_z^T a

while recomputing or reconstructing the forward state.

This allowed training models with very long effective depth.

However, later work showed practical issues:

Issue	Effect
reconstruction error	incorrect gradients
stiffness	unstable backward integration
adaptive solver mismatch	inconsistent forward/backward trajectories
solver overhead	high runtime cost

As a result, many modern systems combine checkpointing, discrete differentiation, and hybrid adjoint methods.

Stiff Systems

A stiff system has rapidly varying modes with very different time scales.

For such systems:

\dot{z}=f(z,\theta,t)

may require implicit numerical solvers.

The adjoint system often inherits or amplifies stiffness.

Backward integration can become unstable unless carefully designed.

In stiff settings, discrete differentiation of the actual solver is often safer than idealized continuous adjoints.

Checkpointing

A practical compromise is checkpointing.

Instead of storing every state or recomputing everything, store selected states:

z0 ---- z100 ---- z200 ---- z300

During backward propagation, recompute intermediate states locally.

Checkpointing trades memory for recomputation in a controlled way.

This is widely used in large-scale differentiable simulations.

Hamiltonian Structure

Adjoint systems naturally form Hamiltonian structures.

Define the Hamiltonian

H(z,a,\theta,t)=a^T f(z,\theta,t).

Then:

\dot{z}=\frac{\partial H}{\partial a}, \qquad \dot{a}=-\frac{\partial H}{\partial z}.

The forward and backward equations form a coupled Hamiltonian system.

This connection explains many conservation and stability properties of adjoint dynamics.

Failure Modes

Continuous adjoint methods can fail for several reasons.

Non-smooth dynamics

Events, discontinuities, or switching systems break differentiability.

Chaotic systems

Tiny perturbations may grow exponentially, making gradients unreliable.

Numerical inconsistency

Backward reconstruction may not match the forward solve.

Ill-conditioned sensitivity

The adjoint norm may explode or vanish over long time horizons.

Solver mismatch

The backward pass may differentiate a continuous equation while the forward pass used a heavily discretized approximation.

These issues are central in differentiable physics and scientific machine learning.

Design Principle

A continuous-time adjoint method should specify:

Component	Question
primal dynamics	What ODE or PDE defines the system?
discretization	Which numerical solver is used?
backward rule	Continuous adjoint or discrete reverse AD?
state storage	Full storage, recomputation, or checkpointing?
sensitivity accuracy	Exact, approximate, or surrogate gradients?

Without these contracts, gradients may appear mathematically correct while differing from the implemented computation.

Summary

Continuous-time adjoint methods extend reverse-mode differentiation from discrete computation graphs to continuous dynamical systems.

The forward pass solves a differential equation forward in time. The backward pass solves an adjoint differential equation backward in time.

This gives efficient gradients for systems with long trajectories and many parameters. The method underlies differentiable simulation, optimal control, inverse problems, neural ODEs, and PDE-constrained optimization.

In practice, the distinction between continuous equations and discrete numerical solvers is fundamental. A correct implementation must decide whether gradients correspond to the mathematical system, the numerical integrator, or some hybrid approximation between them.