# Neural ODEs

## Neural ODEs

Classical neural networks apply a finite sequence of transformations:

$$
h_{k+1} = f_k(h_k,\theta_k).
$$

Depth corresponds to the number of layers.

A neural ordinary differential equation (Neural ODE) replaces this discrete stack with continuous dynamics:

$$
\frac{dz(t)}{dt} = f_\theta(z(t), t).
$$

Instead of evaluating a finite composition, the model integrates a differential equation from an initial state to a final time.

The output is

$$
z(t_1),
$$

obtained by solving the ODE beginning from

$$
z(t_0)=z_0.
$$

This transforms network depth into integration time.

## Residual Networks and Continuous Limits

Neural ODEs emerged from the observation that residual networks resemble Euler discretizations.

A residual block has the form

$$
h_{k+1}=h_k+f(h_k,\theta_k).
$$

Introduce a step size `h`:

$$
h_{k+1}=h_k+h\,f(h_k,\theta_k).
$$

This is explicit Euler integration for

$$
\frac{dz(t)}{dt}=f(z(t),\theta(t)).
$$

As the number of layers increases and the step size decreases, the discrete residual network approaches a continuous dynamical system.

Neural ODEs therefore interpret deep networks as continuous-time flows.

## Continuous Dynamics

The core model is

$$
\dot{z}(t)=f_\theta(z(t),t),
\qquad
z(t_0)=x.
$$

The output prediction is

$$
y=g(z(t_1)).
$$

The vector field `fθ` is usually parameterized by a neural network.

A typical implementation is:

```text
dz/dt = neural_net(z, t, theta)
```

The forward pass numerically integrates the ODE:

```text
z1 = ode_solve(f_theta, z0, t0, t1)
```

The integration method may use adaptive Runge-Kutta, Dormand-Prince, BDF, or other ODE solvers.

## Flow Interpretation

A Neural ODE defines a continuous transformation of space.

Each input point evolves according to the vector field:

$$
\dot{z}=f_\theta(z,t).
$$

The trajectory defines a flow map:

$$
\phi_{t_0 \to t_1}(x).
$$

Thus the network computes

$$
z(t_1)=\phi(x).
$$

Unlike ordinary feedforward networks, the transformation is constrained to arise from continuous dynamics.

This introduces geometric structure into the model.

## Existence and Uniqueness

If `fθ` is sufficiently smooth and Lipschitz continuous in `z`, then the ODE has a unique local solution.

The learned model therefore defines a deterministic continuous transformation.

This differs from arbitrary discrete architectures, which may introduce discontinuities or undefined behavior.

The smoothness assumptions also affect differentiability of the loss with respect to parameters.

## Training Neural ODEs

Suppose the loss is

$$
\ell = L(z(t_1)).
$$

Training requires gradients with respect to `θ`.

One option is discrete reverse-mode AD through the numerical solver.

Another option is the continuous adjoint method.

The continuous adjoint introduces

$$
a(t)=\frac{\partial \ell}{\partial z(t)}.
$$

The adjoint equation is

$$
\dot{a}=-f_z^T a,
$$

with terminal condition

$$
a(t_1)=L_z^T.
$$

The parameter gradient is

$$
\frac{d\ell}{d\theta} =
\int_{t_0}^{t_1}
f_\theta^T a \, dt.
$$

This avoids storing every solver state explicitly.

## Continuous Adjoint Implementation

A typical implementation has the following structure.

### Forward pass

```text
z(t1) = ODESolve(f_theta, z0)
```

### Backward pass

Integrate backward:

```text
da/dt = -f_z^T a
```

while accumulating:

```text
dL/dtheta += f_theta^T a
```

This backward solve may reconstruct or recompute forward states.

## Adaptive Computation

A Neural ODE can adapt its effective depth dynamically.

Simple inputs may require few integration steps.

Complex trajectories may require many steps.

This differs from ordinary fixed-depth networks.

The computational cost becomes data dependent:

| Input difficulty | Solver work |
|---|---|
| smooth dynamics | few steps |
| rapidly changing dynamics | many steps |

Adaptive solvers therefore provide dynamic computation allocation.

## Continuous Depth

Traditional networks define representations only at discrete layers:

```text
h0 -> h1 -> h2 -> h3
```

Neural ODEs define representations at every time:

$$
z(t).
$$

Intermediate states can be queried continuously:

$$
z(0.2),\quad z(0.5),\quad z(0.93).
$$

This gives a continuous notion of feature evolution.

## Parameter Sharing

A residual network usually has different parameters at each layer.

Neural ODEs often share parameters across time:

$$
\dot{z}=f_\theta(z,t).
$$

The same vector field governs the entire trajectory.

This reduces parameter count and introduces temporal consistency.

However, it may reduce expressiveness compared with unconstrained layerwise architectures.

## Stability

Continuous dynamics introduce new notions of stability.

The Jacobian

$$
f_z =
\frac{\partial f}{\partial z}
$$

controls local expansion and contraction.

If eigenvalues of `f_z` have large positive real parts, trajectories may diverge rapidly.

If eigenvalues are strongly negative, trajectories may collapse or become stiff.

Thus training Neural ODEs often requires controlling dynamical stability.

## Stiffness

Some learned vector fields become stiff.

A stiff system contains multiple time scales:

$$
\dot{z}=f(z,t)
$$

where some modes evolve much faster than others.

Consequences include:

| Problem | Effect |
|---|---|
| tiny stable step size | slow integration |
| unstable backward solve | poor gradients |
| adaptive solver explosion | excessive computation |

Stiffness is a major practical issue in Neural ODE training.

Implicit integrators may help but increase computational cost.

## Reversibility

An ODE flow is locally invertible under smooth conditions.

Thus Neural ODEs naturally support reversible computation.

This motivated interest in memory-efficient training, because earlier states can theoretically be reconstructed from later states.

In practice, exact reversibility is limited by:

| Source | Problem |
|---|---|
| floating point error | reconstruction drift |
| adaptive solvers | path mismatch |
| chaotic dynamics | exponential sensitivity |

As a result, practical systems often use checkpointing instead of perfect reconstruction.

## Expressiveness

A continuous flow imposes structural constraints.

Not every function can be represented as a smooth ODE flow over finite time.

For example, trajectories cannot cross in state space under deterministic smooth dynamics.

This limits representational power compared with arbitrary discrete mappings.

To increase expressiveness, extensions include:

| Method | Idea |
|---|---|
| augmented Neural ODEs | increase latent dimension |
| stochastic differential equations | add noise |
| controlled differential equations | driven external signals |
| jump dynamics | allow discontinuities |

## Continuous Normalizing Flows

A major application is density modeling.

Ordinary normalizing flows compose invertible discrete transformations:

$$
x \to z_1 \to z_2 \to z_3.
$$

Continuous normalizing flows instead evolve densities continuously.

The dynamics are

$$
\dot{z}=f_\theta(z,t).
$$

The log-density evolves according to

$$
\frac{d}{dt}\log p(z(t)) = -
\operatorname{tr}(f_z).
$$

This avoids explicit Jacobian determinants of discrete transformations.

The likelihood becomes

$$
\log p(z(t_1)) =
\log p(z(t_0)) -
\int
\operatorname{tr}(f_z)\,dt.
$$

This connects Neural ODEs with probabilistic modeling and transport theory.

## Latent ODE Models

Neural ODEs are useful for irregularly sampled time series.

Suppose observations occur at uneven times:

$$
(t_1,x_1),\quad (t_2,x_2),\quad (t_3,x_3).
$$

A latent ODE model learns hidden continuous dynamics between observations.

The latent state evolves continuously:

$$
\dot{z}=f_\theta(z,t).
$$

Observations are decoded from latent states at arbitrary times.

This gives a principled continuous-time model for asynchronous data.

Applications include:

| Domain | Example |
|---|---|
| healthcare | irregular clinical measurements |
| physics | sparse sensor data |
| finance | event-driven observations |
| robotics | asynchronous control streams |

## Controlled Differential Equations

Real systems often depend on external signals:

$$
\dot{z}=f(z,u(t),t).
$$

Controlled differential equations generalize recurrent networks to continuous time.

The driving signal `u(t)` may represent:

| Signal | Meaning |
|---|---|
| sensor stream | physical measurement |
| text embedding | sequential input |
| market signal | external forcing |
| action stream | control input |

Neural controlled differential equations extend Neural ODE ideas to path-dependent systems.

## Solver Dependence

The model output depends on the numerical solver.

Different solvers may produce different trajectories:

| Solver | Behavior |
|---|---|
| Euler | cheap but inaccurate |
| RK4 | accurate fixed-step |
| adaptive Runge-Kutta | dynamic precision |
| implicit BDF | stiff stability |

Thus the numerical method becomes part of the effective model.

This differs from ordinary neural networks, where layer evaluation is usually deterministic and exact up to floating point arithmetic.

## Continuous vs Discrete Gradients

Two gradients are possible.

### Discrete gradient

Differentiate the actual numerical solver.

### Continuous adjoint gradient

Differentiate the underlying continuous ODE.

These coincide only in ideal limits.

Discrete differentiation is often more faithful to the implemented computation. Continuous adjoints may be more memory efficient but less numerically accurate.

Modern systems increasingly favor hybrid approaches.

## Computational Cost

A Neural ODE replaces fixed network depth with numerical integration cost.

Training cost depends on:

| Factor | Effect |
|---|---|
| vector field complexity | harder integration |
| stiffness | smaller step size |
| tolerance | more evaluations |
| backward method | memory/runtime tradeoff |

Function evaluations often dominate runtime.

A difficult trajectory may require hundreds or thousands of evaluations.

## Failure Modes

Neural ODEs can fail in several ways.

### Solver instability

The vector field may generate exploding trajectories.

### Adjoint mismatch

Backward reconstruction may not match the forward trajectory.

### Stiffness

Tiny required step sizes can make training impractical.

### Excessive solver work

Adaptive solvers may spend large computation on difficult inputs.

### Vanishing sensitivity

Long stable trajectories may suppress gradients.

### Chaotic dynamics

Tiny perturbations may create unpredictable gradients.

These problems become severe in long-time integration.

## Comparison with Residual Networks

| Residual Network | Neural ODE |
|---|---|
| finite layers | continuous dynamics |
| explicit composition | ODE integration |
| fixed computation | adaptive computation |
| separate layer weights | shared vector field |
| direct reverse AD | adjoint dynamics possible |
| simple implementation | solver-dependent behavior |

Residual networks are often easier to train and optimize.

Neural ODEs provide stronger geometric structure and continuous-time modeling capabilities.

## Geometric Perspective

A Neural ODE defines a vector field on latent space.

Learning becomes the problem of shaping trajectories through that space.

This connects deep learning with:

| Field | Connection |
|---|---|
| dynamical systems | stability and flows |
| differential geometry | manifolds and transport |
| control theory | trajectory optimization |
| numerical analysis | integration accuracy |
| physics | continuous evolution |

The network is no longer only a function approximator. It becomes a learned dynamical process.

## Summary

Neural ODEs reinterpret deep learning as continuous-time dynamical systems.

A vector field defines state evolution through an ordinary differential equation. The forward pass solves the ODE. The backward pass computes sensitivities through adjoint dynamics or discrete reverse differentiation.

This framework unifies neural networks, dynamical systems, numerical integration, and optimal control. It also introduces new computational and mathematical challenges involving stability, stiffness, solver dependence, and continuous-time sensitivity analysis.

