Neural ODEs

Classical neural networks apply a finite sequence of transformations:

$$ h_{k+1} = f_k(h_k,\theta_k). $$

Depth corresponds to the number of layers.

A neural ordinary differential equation (Neural ODE) replaces this discrete stack with continuous dynamics:

$$ \frac{dz(t)}{dt} = f_\theta(z(t), t). $$

Instead of evaluating a finite composition, the model integrates a differential equation from an initial state to a final time.

The output is

$$ z(t_1), $$

obtained by solving the ODE beginning from

$$ z(t_0)=z_0. $$

This transforms network depth into integration time.

Residual Networks and Continuous Limits

Neural ODEs emerged from the observation that residual networks resemble Euler discretizations.

A residual block has the form

$$ h_{k+1}=h_k+f(h_k,\theta_k). $$

Introduce a step size h:

$$ h_{k+1}=h_k+h,f(h_k,\theta_k). $$

This is explicit Euler integration for

$$ \frac{dz(t)}{dt}=f(z(t),\theta(t)). $$

As the number of layers increases and the step size decreases, the discrete residual network approaches a continuous dynamical system.

Neural ODEs therefore interpret deep networks as continuous-time flows.

Continuous Dynamics

The core model is

$$ \dot{z}(t)=f_\theta(z(t),t), \qquad z(t_0)=x. $$

The output prediction is

$$ y=g(z(t_1)). $$

The vector field fθ is usually parameterized by a neural network.

A typical implementation is:

dz/dt = neural_net(z, t, theta)

The forward pass numerically integrates the ODE:

z1 = ode_solve(f_theta, z0, t0, t1)

The integration method may use adaptive Runge-Kutta, Dormand-Prince, BDF, or other ODE solvers.

Flow Interpretation

A Neural ODE defines a continuous transformation of space.

Each input point evolves according to the vector field:

$$ \dot{z}=f_\theta(z,t). $$

The trajectory defines a flow map:

$$ \phi_{t_0 \to t_1}(x). $$

Thus the network computes

$$ z(t_1)=\phi(x). $$

Unlike ordinary feedforward networks, the transformation is constrained to arise from continuous dynamics.

This introduces geometric structure into the model.

Existence and Uniqueness

If fθ is sufficiently smooth and Lipschitz continuous in z, then the ODE has a unique local solution.

The learned model therefore defines a deterministic continuous transformation.

This differs from arbitrary discrete architectures, which may introduce discontinuities or undefined behavior.

The smoothness assumptions also affect differentiability of the loss with respect to parameters.

Training Neural ODEs

Suppose the loss is

$$ \ell = L(z(t_1)). $$

Training requires gradients with respect to θ.

One option is discrete reverse-mode AD through the numerical solver.

Another option is the continuous adjoint method.

The continuous adjoint introduces

$$ a(t)=\frac{\partial \ell}{\partial z(t)}. $$

The adjoint equation is

$$ \dot{a}=-f_z^T a, $$

with terminal condition

$$ a(t_1)=L_z^T. $$

The parameter gradient is

$$ \frac{d\ell}{d\theta} = \int_{t_0}^{t_1} f_\theta^T a , dt. $$

This avoids storing every solver state explicitly.

Continuous Adjoint Implementation

A typical implementation has the following structure.

Forward pass

z(t1) = ODESolve(f_theta, z0)

Backward pass

Integrate backward:

da/dt = -f_z^T a

while accumulating:

dL/dtheta += f_theta^T a

This backward solve may reconstruct or recompute forward states.

Adaptive Computation

A Neural ODE can adapt its effective depth dynamically.

Simple inputs may require few integration steps.

Complex trajectories may require many steps.

This differs from ordinary fixed-depth networks.

The computational cost becomes data dependent:

Input difficulty	Solver work
smooth dynamics	few steps
rapidly changing dynamics	many steps

Adaptive solvers therefore provide dynamic computation allocation.

Continuous Depth

Traditional networks define representations only at discrete layers:

h0 -> h1 -> h2 -> h3

Neural ODEs define representations at every time:

$$ z(t). $$

Intermediate states can be queried continuously:

$$ z(0.2),\quad z(0.5),\quad z(0.93). $$

This gives a continuous notion of feature evolution.

A residual network usually has different parameters at each layer.

Neural ODEs often share parameters across time:

$$ \dot{z}=f_\theta(z,t). $$

The same vector field governs the entire trajectory.

This reduces parameter count and introduces temporal consistency.

However, it may reduce expressiveness compared with unconstrained layerwise architectures.

Stability

Continuous dynamics introduce new notions of stability.

The Jacobian

$$ f_z = \frac{\partial f}{\partial z} $$

controls local expansion and contraction.

If eigenvalues of f_z have large positive real parts, trajectories may diverge rapidly.

If eigenvalues are strongly negative, trajectories may collapse or become stiff.

Thus training Neural ODEs often requires controlling dynamical stability.

Stiffness

Some learned vector fields become stiff.

A stiff system contains multiple time scales:

$$ \dot{z}=f(z,t) $$

where some modes evolve much faster than others.

Consequences include:

Problem	Effect
tiny stable step size	slow integration
unstable backward solve	poor gradients
adaptive solver explosion	excessive computation

Stiffness is a major practical issue in Neural ODE training.

Implicit integrators may help but increase computational cost.

Reversibility

An ODE flow is locally invertible under smooth conditions.

Thus Neural ODEs naturally support reversible computation.

This motivated interest in memory-efficient training, because earlier states can theoretically be reconstructed from later states.

In practice, exact reversibility is limited by:

Source	Problem
floating point error	reconstruction drift
adaptive solvers	path mismatch
chaotic dynamics	exponential sensitivity

As a result, practical systems often use checkpointing instead of perfect reconstruction.

Expressiveness

A continuous flow imposes structural constraints.

Not every function can be represented as a smooth ODE flow over finite time.

For example, trajectories cannot cross in state space under deterministic smooth dynamics.

This limits representational power compared with arbitrary discrete mappings.

To increase expressiveness, extensions include:

Method	Idea
augmented Neural ODEs	increase latent dimension
stochastic differential equations	add noise
controlled differential equations	driven external signals
jump dynamics	allow discontinuities

Continuous Normalizing Flows

A major application is density modeling.

Ordinary normalizing flows compose invertible discrete transformations:

$$ x \to z_1 \to z_2 \to z_3. $$

Continuous normalizing flows instead evolve densities continuously.

The dynamics are

$$ \dot{z}=f_\theta(z,t). $$

The log-density evolves according to

$$ \frac{d}{dt}\log p(z(t)) = - \operatorname{tr}(f_z). $$

This avoids explicit Jacobian determinants of discrete transformations.

The likelihood becomes

$$ \log p(z(t_1)) = \log p(z(t_0)) - \int \operatorname{tr}(f_z),dt. $$

This connects Neural ODEs with probabilistic modeling and transport theory.

Latent ODE Models

Neural ODEs are useful for irregularly sampled time series.

Suppose observations occur at uneven times:

$$ (t_1,x_1),\quad (t_2,x_2),\quad (t_3,x_3). $$

A latent ODE model learns hidden continuous dynamics between observations.

The latent state evolves continuously:

$$ \dot{z}=f_\theta(z,t). $$

Observations are decoded from latent states at arbitrary times.

This gives a principled continuous-time model for asynchronous data.

Applications include:

Domain	Example
healthcare	irregular clinical measurements
physics	sparse sensor data
finance	event-driven observations
robotics	asynchronous control streams

Controlled Differential Equations

Real systems often depend on external signals:

$$ \dot{z}=f(z,u(t),t). $$

Controlled differential equations generalize recurrent networks to continuous time.

The driving signal u(t) may represent:

Signal	Meaning
sensor stream	physical measurement
text embedding	sequential input
market signal	external forcing
action stream	control input

Neural controlled differential equations extend Neural ODE ideas to path-dependent systems.

Solver Dependence

The model output depends on the numerical solver.

Different solvers may produce different trajectories:

Solver	Behavior
Euler	cheap but inaccurate
RK4	accurate fixed-step
adaptive Runge-Kutta	dynamic precision
implicit BDF	stiff stability

Thus the numerical method becomes part of the effective model.

This differs from ordinary neural networks, where layer evaluation is usually deterministic and exact up to floating point arithmetic.

Continuous vs Discrete Gradients

Two gradients are possible.

Discrete gradient

Differentiate the actual numerical solver.

Continuous adjoint gradient

Differentiate the underlying continuous ODE.

These coincide only in ideal limits.

Discrete differentiation is often more faithful to the implemented computation. Continuous adjoints may be more memory efficient but less numerically accurate.

Modern systems increasingly favor hybrid approaches.

Computational Cost

A Neural ODE replaces fixed network depth with numerical integration cost.

Training cost depends on:

Factor	Effect
vector field complexity	harder integration
stiffness	smaller step size
tolerance	more evaluations
backward method	memory/runtime tradeoff

Function evaluations often dominate runtime.

A difficult trajectory may require hundreds or thousands of evaluations.

Failure Modes

Neural ODEs can fail in several ways.

Solver instability

The vector field may generate exploding trajectories.

Adjoint mismatch

Backward reconstruction may not match the forward trajectory.

Stiffness

Tiny required step sizes can make training impractical.

Excessive solver work

Adaptive solvers may spend large computation on difficult inputs.

Vanishing sensitivity

Long stable trajectories may suppress gradients.

Chaotic dynamics

Tiny perturbations may create unpredictable gradients.

These problems become severe in long-time integration.

Comparison with Residual Networks

Residual Network	Neural ODE
finite layers	continuous dynamics
explicit composition	ODE integration
fixed computation	adaptive computation
separate layer weights	shared vector field
direct reverse AD	adjoint dynamics possible
simple implementation	solver-dependent behavior

Residual networks are often easier to train and optimize.

Neural ODEs provide stronger geometric structure and continuous-time modeling capabilities.

Geometric Perspective

A Neural ODE defines a vector field on latent space.

Learning becomes the problem of shaping trajectories through that space.

This connects deep learning with:

Field	Connection
dynamical systems	stability and flows
differential geometry	manifolds and transport
control theory	trajectory optimization
numerical analysis	integration accuracy
physics	continuous evolution

The network is no longer only a function approximator. It becomes a learned dynamical process.

Summary

Neural ODEs reinterpret deep learning as continuous-time dynamical systems.

A vector field defines state evolution through an ordinary differential equation. The forward pass solves the ODE. The backward pass computes sensitivities through adjoint dynamics or discrete reverse differentiation.

This framework unifies neural networks, dynamical systems, numerical integration, and optimal control. It also introduces new computational and mathematical challenges involving stability, stiffness, solver dependence, and continuous-time sensitivity analysis.

Neural ODEs

Neural ODEs

Residual Networks and Continuous Limits

Continuous Dynamics

Flow Interpretation

Existence and Uniqueness

Training Neural ODEs

Continuous Adjoint Implementation

Forward pass

Backward pass

Adaptive Computation

Continuous Depth

Parameter Sharing

Stability

Stiffness

Reversibility

Expressiveness

Continuous Normalizing Flows

Latent ODE Models

Controlled Differential Equations

Solver Dependence

Continuous vs Discrete Gradients

Discrete gradient

Continuous adjoint gradient

Computational Cost

Failure Modes

Solver instability

Adjoint mismatch

Stiffness

Excessive solver work

Vanishing sensitivity

Chaotic dynamics

Comparison with Residual Networks

Geometric Perspective

Summary