Skip to content

Neural ODEs

Classical neural networks apply a finite sequence of transformations:

Classical neural networks apply a finite sequence of transformations:

hk+1=fk(hk,θk). h_{k+1} = f_k(h_k,\theta_k).

Depth corresponds to the number of layers.

A neural ordinary differential equation (Neural ODE) replaces this discrete stack with continuous dynamics:

dz(t)dt=fθ(z(t),t). \frac{dz(t)}{dt} = f_\theta(z(t), t).

Instead of evaluating a finite composition, the model integrates a differential equation from an initial state to a final time.

The output is

z(t1), z(t_1),

obtained by solving the ODE beginning from

z(t0)=z0. z(t_0)=z_0.

This transforms network depth into integration time.

Residual Networks and Continuous Limits

Neural ODEs emerged from the observation that residual networks resemble Euler discretizations.

A residual block has the form

hk+1=hk+f(hk,θk). h_{k+1}=h_k+f(h_k,\theta_k).

Introduce a step size h:

hk+1=hk+hf(hk,θk). h_{k+1}=h_k+h\,f(h_k,\theta_k).

This is explicit Euler integration for

dz(t)dt=f(z(t),θ(t)). \frac{dz(t)}{dt}=f(z(t),\theta(t)).

As the number of layers increases and the step size decreases, the discrete residual network approaches a continuous dynamical system.

Neural ODEs therefore interpret deep networks as continuous-time flows.

Continuous Dynamics

The core model is

z˙(t)=fθ(z(t),t),z(t0)=x. \dot{z}(t)=f_\theta(z(t),t), \qquad z(t_0)=x.

The output prediction is

y=g(z(t1)). y=g(z(t_1)).

The vector field is usually parameterized by a neural network.

A typical implementation is:

dz/dt = neural_net(z, t, theta)

The forward pass numerically integrates the ODE:

z1 = ode_solve(f_theta, z0, t0, t1)

The integration method may use adaptive Runge-Kutta, Dormand-Prince, BDF, or other ODE solvers.

Flow Interpretation

A Neural ODE defines a continuous transformation of space.

Each input point evolves according to the vector field:

z˙=fθ(z,t). \dot{z}=f_\theta(z,t).

The trajectory defines a flow map:

ϕt0t1(x). \phi_{t_0 \to t_1}(x).

Thus the network computes

z(t1)=ϕ(x). z(t_1)=\phi(x).

Unlike ordinary feedforward networks, the transformation is constrained to arise from continuous dynamics.

This introduces geometric structure into the model.

Existence and Uniqueness

If is sufficiently smooth and Lipschitz continuous in z, then the ODE has a unique local solution.

The learned model therefore defines a deterministic continuous transformation.

This differs from arbitrary discrete architectures, which may introduce discontinuities or undefined behavior.

The smoothness assumptions also affect differentiability of the loss with respect to parameters.

Training Neural ODEs

Suppose the loss is

=L(z(t1)). \ell = L(z(t_1)).

Training requires gradients with respect to θ.

One option is discrete reverse-mode AD through the numerical solver.

Another option is the continuous adjoint method.

The continuous adjoint introduces

a(t)=z(t). a(t)=\frac{\partial \ell}{\partial z(t)}.

The adjoint equation is

a˙=fzTa, \dot{a}=-f_z^T a,

with terminal condition

a(t1)=LzT. a(t_1)=L_z^T.

The parameter gradient is

ddθ=t0t1fθTadt. \frac{d\ell}{d\theta} = \int_{t_0}^{t_1} f_\theta^T a \, dt.

This avoids storing every solver state explicitly.

Continuous Adjoint Implementation

A typical implementation has the following structure.

Forward pass

z(t1) = ODESolve(f_theta, z0)

Backward pass

Integrate backward:

da/dt = -f_z^T a

while accumulating:

dL/dtheta += f_theta^T a

This backward solve may reconstruct or recompute forward states.

Adaptive Computation

A Neural ODE can adapt its effective depth dynamically.

Simple inputs may require few integration steps.

Complex trajectories may require many steps.

This differs from ordinary fixed-depth networks.

The computational cost becomes data dependent:

Input difficultySolver work
smooth dynamicsfew steps
rapidly changing dynamicsmany steps

Adaptive solvers therefore provide dynamic computation allocation.

Continuous Depth

Traditional networks define representations only at discrete layers:

h0 -> h1 -> h2 -> h3

Neural ODEs define representations at every time:

z(t). z(t).

Intermediate states can be queried continuously:

z(0.2),z(0.5),z(0.93). z(0.2),\quad z(0.5),\quad z(0.93).

This gives a continuous notion of feature evolution.

Parameter Sharing

A residual network usually has different parameters at each layer.

Neural ODEs often share parameters across time:

z˙=fθ(z,t). \dot{z}=f_\theta(z,t).

The same vector field governs the entire trajectory.

This reduces parameter count and introduces temporal consistency.

However, it may reduce expressiveness compared with unconstrained layerwise architectures.

Stability

Continuous dynamics introduce new notions of stability.

The Jacobian

fz=fz f_z = \frac{\partial f}{\partial z}

controls local expansion and contraction.

If eigenvalues of f_z have large positive real parts, trajectories may diverge rapidly.

If eigenvalues are strongly negative, trajectories may collapse or become stiff.

Thus training Neural ODEs often requires controlling dynamical stability.

Stiffness

Some learned vector fields become stiff.

A stiff system contains multiple time scales:

z˙=f(z,t) \dot{z}=f(z,t)

where some modes evolve much faster than others.

Consequences include:

ProblemEffect
tiny stable step sizeslow integration
unstable backward solvepoor gradients
adaptive solver explosionexcessive computation

Stiffness is a major practical issue in Neural ODE training.

Implicit integrators may help but increase computational cost.

Reversibility

An ODE flow is locally invertible under smooth conditions.

Thus Neural ODEs naturally support reversible computation.

This motivated interest in memory-efficient training, because earlier states can theoretically be reconstructed from later states.

In practice, exact reversibility is limited by:

SourceProblem
floating point errorreconstruction drift
adaptive solverspath mismatch
chaotic dynamicsexponential sensitivity

As a result, practical systems often use checkpointing instead of perfect reconstruction.

Expressiveness

A continuous flow imposes structural constraints.

Not every function can be represented as a smooth ODE flow over finite time.

For example, trajectories cannot cross in state space under deterministic smooth dynamics.

This limits representational power compared with arbitrary discrete mappings.

To increase expressiveness, extensions include:

MethodIdea
augmented Neural ODEsincrease latent dimension
stochastic differential equationsadd noise
controlled differential equationsdriven external signals
jump dynamicsallow discontinuities

Continuous Normalizing Flows

A major application is density modeling.

Ordinary normalizing flows compose invertible discrete transformations:

xz1z2z3. x \to z_1 \to z_2 \to z_3.

Continuous normalizing flows instead evolve densities continuously.

The dynamics are

z˙=fθ(z,t). \dot{z}=f_\theta(z,t).

The log-density evolves according to

ddtlogp(z(t))=tr(fz). \frac{d}{dt}\log p(z(t)) = - \operatorname{tr}(f_z).

This avoids explicit Jacobian determinants of discrete transformations.

The likelihood becomes

logp(z(t1))=logp(z(t0))tr(fz)dt. \log p(z(t_1)) = \log p(z(t_0)) - \int \operatorname{tr}(f_z)\,dt.

This connects Neural ODEs with probabilistic modeling and transport theory.

Latent ODE Models

Neural ODEs are useful for irregularly sampled time series.

Suppose observations occur at uneven times:

(t1,x1),(t2,x2),(t3,x3). (t_1,x_1),\quad (t_2,x_2),\quad (t_3,x_3).

A latent ODE model learns hidden continuous dynamics between observations.

The latent state evolves continuously:

z˙=fθ(z,t). \dot{z}=f_\theta(z,t).

Observations are decoded from latent states at arbitrary times.

This gives a principled continuous-time model for asynchronous data.

Applications include:

DomainExample
healthcareirregular clinical measurements
physicssparse sensor data
financeevent-driven observations
roboticsasynchronous control streams

Controlled Differential Equations

Real systems often depend on external signals:

z˙=f(z,u(t),t). \dot{z}=f(z,u(t),t).

Controlled differential equations generalize recurrent networks to continuous time.

The driving signal u(t) may represent:

SignalMeaning
sensor streamphysical measurement
text embeddingsequential input
market signalexternal forcing
action streamcontrol input

Neural controlled differential equations extend Neural ODE ideas to path-dependent systems.

Solver Dependence

The model output depends on the numerical solver.

Different solvers may produce different trajectories:

SolverBehavior
Eulercheap but inaccurate
RK4accurate fixed-step
adaptive Runge-Kuttadynamic precision
implicit BDFstiff stability

Thus the numerical method becomes part of the effective model.

This differs from ordinary neural networks, where layer evaluation is usually deterministic and exact up to floating point arithmetic.

Continuous vs Discrete Gradients

Two gradients are possible.

Discrete gradient

Differentiate the actual numerical solver.

Continuous adjoint gradient

Differentiate the underlying continuous ODE.

These coincide only in ideal limits.

Discrete differentiation is often more faithful to the implemented computation. Continuous adjoints may be more memory efficient but less numerically accurate.

Modern systems increasingly favor hybrid approaches.

Computational Cost

A Neural ODE replaces fixed network depth with numerical integration cost.

Training cost depends on:

FactorEffect
vector field complexityharder integration
stiffnesssmaller step size
tolerancemore evaluations
backward methodmemory/runtime tradeoff

Function evaluations often dominate runtime.

A difficult trajectory may require hundreds or thousands of evaluations.

Failure Modes

Neural ODEs can fail in several ways.

Solver instability

The vector field may generate exploding trajectories.

Adjoint mismatch

Backward reconstruction may not match the forward trajectory.

Stiffness

Tiny required step sizes can make training impractical.

Excessive solver work

Adaptive solvers may spend large computation on difficult inputs.

Vanishing sensitivity

Long stable trajectories may suppress gradients.

Chaotic dynamics

Tiny perturbations may create unpredictable gradients.

These problems become severe in long-time integration.

Comparison with Residual Networks

Residual NetworkNeural ODE
finite layerscontinuous dynamics
explicit compositionODE integration
fixed computationadaptive computation
separate layer weightsshared vector field
direct reverse ADadjoint dynamics possible
simple implementationsolver-dependent behavior

Residual networks are often easier to train and optimize.

Neural ODEs provide stronger geometric structure and continuous-time modeling capabilities.

Geometric Perspective

A Neural ODE defines a vector field on latent space.

Learning becomes the problem of shaping trajectories through that space.

This connects deep learning with:

FieldConnection
dynamical systemsstability and flows
differential geometrymanifolds and transport
control theorytrajectory optimization
numerical analysisintegration accuracy
physicscontinuous evolution

The network is no longer only a function approximator. It becomes a learned dynamical process.

Summary

Neural ODEs reinterpret deep learning as continuous-time dynamical systems.

A vector field defines state evolution through an ordinary differential equation. The forward pass solves the ODE. The backward pass computes sensitivities through adjoint dynamics or discrete reverse differentiation.

This framework unifies neural networks, dynamical systems, numerical integration, and optimal control. It also introduces new computational and mathematical challenges involving stability, stiffness, solver dependence, and continuous-time sensitivity analysis.