# Overflow and Underflow

## Overflow and Underflow

Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too small to represent accurately, underflow occurs.

Automatic differentiation inherits both phenomena from the primal computation and may amplify them during derivative propagation.

Overflow and underflow are not rare edge cases. They appear routinely in:

- deep neural networks,
- exponential models,
- probabilistic inference,
- scientific simulations,
- recurrent systems,
- optimization algorithms,
- and mixed-precision training.

Understanding these effects is essential for building stable differentiable systems.

### Floating Point Range

A floating point format has:

- finite precision,
- finite exponent range,
- and finite density of representable values.

For IEEE 754 float32:

| Quantity | Approximate value |
|---|---:|
| Largest finite value | $3.4 \times 10^{38}$ |
| Smallest normal positive value | $1.18 \times 10^{-38}$ |
| Smallest subnormal positive value | $1.4 \times 10^{-45}$ |

For float16:

| Quantity | Approximate value |
|---|---:|
| Largest finite value | $6.55 \times 10^4$ |
| Smallest normal positive value | $6.10 \times 10^{-5}$ |
| Smallest subnormal positive value | $5.96 \times 10^{-8}$ |

The much narrower range of float16 explains why mixed-precision systems are particularly vulnerable.

### Overflow

Overflow occurs when:

$$
|x| > x_{\max}.
$$

The result becomes:

- $+\infty$,
- $-\infty$,
- or sometimes NaN after later operations.

Overflow often spreads rapidly through computational graphs.

### Exponential Growth

The exponential function is the canonical example:

$$
f(x)=e^x
$$

In float32:

$$
e^{88.7} \approx 3.4 \times 10^{38},
$$

which is already near the largest representable value.

Thus:

$$
e^{100} = \infty
$$

in float32 arithmetic.

The derivative is identical:

$$
\frac{d}{dx} e^x = e^x.
$$

So the gradient also overflows.

This is common in softmax layers, partition functions, probabilistic normalization, and energy-based models.

### Overflow in Reverse Mode

Reverse mode often amplifies overflow.

Suppose:

$$
y = e^x.
$$

The reverse rule is:

$$
\bar{x} \mathrel{+}= \bar{y} e^x.
$$

If $e^x$ overflowed during the forward pass, then the backward pass propagates infinities.

Even worse, later operations may produce:

$$
0 \cdot \infty,
$$

which yields NaN.

Once NaNs appear in a reverse graph, they frequently contaminate large portions of the gradient computation.

### Underflow

Underflow occurs when:

$$
0 < |x| < x_{\min}.
$$

The value becomes either:

- a subnormal number,
- or exactly zero.

Subnormal numbers extend the representable range below the normal threshold, but with reduced precision.

Eventually values collapse to zero entirely.

### Exponential Decay

Consider:

$$
f(x)=e^{-x}
$$

For large $x$:

$$
e^{-100} \approx 3.7 \times 10^{-44}.
$$

In float32, this is near the subnormal region.

For larger $x$, the value underflows to zero.

The derivative behaves identically:

$$
\frac{d}{dx} e^{-x} = -e^{-x}.
$$

So gradients disappear.

This is one source of vanishing gradients.

### Subnormal Numbers

Subnormal numbers fill the gap between:

$$
0
$$

and the smallest normal floating point value.

They preserve gradual underflow.

Without subnormals, tiny values would abruptly jump to zero.

However, subnormals have costs:

| Issue | Description |
|---|---|
| Reduced precision | Fewer significant bits |
| Slow execution | Some hardware handles them slowly |
| Numerical instability | Relative error becomes large |

Many accelerators therefore flush subnormals to zero for performance reasons.

This improves throughput but worsens gradient preservation.

### Overflow in Products

Repeated multiplication easily overflows.

Suppose:

$$
x_n = \prod_{k=1}^n a_k.
$$

If:

$$
|a_k| > 1,
$$

then magnitudes grow exponentially.

For example:

$$
2^{128} \approx 3.4 \times 10^{38},
$$

already near float32 overflow.

Reverse-mode differentiation through long multiplicative chains is therefore extremely unstable.

### Underflow in Products

Likewise:

$$
|a_k| < 1
$$

causes exponential decay.

Example:

$$
0.5^{150} \approx 7 \times 10^{-46}.
$$

This underflows in float32.

The corresponding gradients also collapse.

This is one reason why sigmoid and tanh recurrent networks historically suffered severe vanishing-gradient problems.

### Overflow from Squaring

Squaring magnifies magnitude rapidly:

$$
x^2, x^4, x^8, \dots
$$

Even moderate values can overflow after repeated squaring.

Example:

$$
(10^{10})^4 = 10^{40},
$$

which exceeds float32 range.

Gradients grow similarly:

$$
\frac{d}{dx} x^n = nx^{n-1}.
$$

Large exponents create extreme sensitivity.

### Logarithmic Stabilization

Many unstable computations can be reformulated using logarithms.

Instead of:

$$
\prod_i p_i,
$$

compute:

$$
\sum_i \log p_i.
$$

Instead of:

$$
e^x,
$$

use log-domain representations whenever possible.

Probabilistic systems rely heavily on this principle.

### Stable Softmax

Naive softmax:

$$
\operatorname{softmax}(x_i) =
\frac{e^{x_i}}
{\sum_j e^{x_j}}
$$

is numerically dangerous.

Large positive logits overflow.

The stable form subtracts the maximum:

$$
m = \max_j x_j,
$$

$$
\operatorname{softmax}(x_i) =
\frac{e^{x_i-m}}
{\sum_j e^{x_j-m}}.
$$

This transformation preserves the mathematical result because:

$$
\frac{e^{x_i-m}}
{\sum_j e^{x_j-m}} =
\frac{e^{x_i}}
{\sum_j e^{x_j}}.
$$

But numerically, all exponentials are now bounded by:

$$
e^0 = 1.
$$

This dramatically improves both forward and backward stability.

### Log-Sum-Exp

The same principle applies to:

$$
\log \sum_i e^{x_i}.
$$

Direct evaluation may overflow.

The stable form is:

$$
m + \log \sum_i e^{x_i-m}.
$$

This operation appears constantly in:

- probabilistic models,
- partition functions,
- attention mechanisms,
- variational inference,
- and energy-based learning.

### Sigmoid Instability

The sigmoid function:

$$
\sigma(x) =
\frac{1}{1+e^{-x}}
$$

contains exponentials.

For large negative $x$:

$$
e^{-x}
$$

overflows.

A stable implementation uses branching:

$$
\sigma(x) =
\begin{cases}
\frac{1}{1+e^{-x}} & x \geq 0 \\
\frac{e^x}{1+e^x} & x < 0
\end{cases}
$$

This avoids overflow in both regions.

### NaN Generation

Overflow and underflow often produce NaNs indirectly.

Common invalid operations include:

| Operation | Result |
|---|---|
| $\infty - \infty$ | NaN |
| $0 \cdot \infty$ | NaN |
| $\frac{0}{0}$ | NaN |
| $\frac{\infty}{\infty}$ | NaN |
| $\log 0$ | $-\infty$ |
| $\sqrt{-1}$ | NaN |

NaNs are particularly dangerous because they propagate silently through many tensor operations.

### Overflow in Hessians

Second-order derivatives amplify instability further.

Consider:

$$
f(x)=e^x.
$$

Then:

$$
f'(x)=e^x,
$$

$$
f''(x)=e^x.
$$

Every derivative level inherits the same exponential growth.

For more complex functions, higher derivatives often grow even faster.

This makes higher-order AD significantly more numerically fragile.

### Mixed Precision Systems

Mixed precision introduces additional overflow and underflow risks.

Float16 has very limited range:

| Type | Max value |
|---|---:|
| float16 | $6.55 \times 10^4$ |
| float32 | $3.4 \times 10^{38}$ |

Values safe in float32 may overflow immediately in float16.

Gradients safe in float32 may underflow to zero in float16.

This motivates:

- loss scaling,
- higher precision accumulators,
- fused stable kernels,
- and normalization strategies.

### Loss Scaling

Suppose gradients are extremely small:

$$
\nabla L \approx 10^{-12}.
$$

In float16, they may underflow.

Loss scaling multiplies the loss by:

$$
\alpha.
$$

Backpropagation computes:

$$
\nabla (\alpha L) =
\alpha \nabla L.
$$

Gradients remain representable during propagation. They are divided by $\alpha$ afterward.

Dynamic loss scaling automatically adjusts $\alpha$ to balance overflow and underflow risk.

### Stable Reduction Algorithms

Large reductions can overflow even when individual terms are safe.

Example:

$$
\sum_{i=1}^{10^9} 10^{30}.
$$

Intermediate sums may exceed representable range.

Stable reduction strategies include:

| Method | Purpose |
|---|---|
| Pairwise summation | Reduce accumulation error |
| Kahan summation | Compensate rounding |
| Block scaling | Normalize partial sums |
| Log-domain accumulation | Avoid magnitude explosion |

### Gradient Clipping

Gradient clipping prevents overflow during optimization.

Given gradient $g$:

$$
g
\leftarrow
g \cdot
\min\left(
1,
\frac{\tau}{\|g\|}
\right).
$$

This limits gradient magnitude.

Clipping does not solve conditioning problems, but it prevents catastrophic updates.

### Overflow in Optimizers

Optimization algorithms also suffer overflow.

Example:

$$
v_t = \beta v_{t-1} + (1-\beta)g_t^2.
$$

Large gradients may overflow the squared term:

$$
g_t^2.
$$

Adaptive optimizers therefore require careful numerical design.

Even simple expressions like:

$$
\sqrt{v_t}
$$

may produce NaNs if $v_t$ becomes invalid.

### Hardware Effects

Different hardware handles overflow differently.

| Hardware | Common behavior |
|---|---|
| CPU | Full IEEE semantics |
| GPU | Sometimes reduced precision |
| TPU | Aggressive fused execution |
| AI accelerators | May flush subnormals |

Fused kernels may change rounding behavior.

Compiler optimizations may reorder arithmetic.

Two mathematically identical programs may therefore exhibit different overflow behavior across devices.

### Detecting Overflow and Underflow

Practical systems monitor:

| Signal | Meaning |
|---|---|
| NaN gradients | Invalid arithmetic |
| Inf activations | Overflow |
| Zero gradients | Possible underflow |
| Sudden loss spikes | Numerical instability |
| Divergent optimization | Exploding gradients |

Instrumentation is essential in large systems.

### Stable Numerical Design

A stable differentiable program typically follows several principles:

| Principle | Purpose |
|---|---|
| Normalize magnitudes | Prevent extreme scales |
| Use log-domain algebra | Avoid exponential blowup |
| Avoid subtracting large close values | Reduce cancellation |
| Use stable fused operators | Improve local conditioning |
| Prefer bounded activations carefully | Control gradient magnitude |
| Use residual pathways | Preserve gradient flow |
| Keep accumulators higher precision | Reduce rounding loss |

### Core Idea

Overflow and underflow are fundamental consequences of finite-range arithmetic. Automatic differentiation propagates and often amplifies these numerical effects because gradients inherit the scaling behavior of the primal computation. Stable differentiable systems therefore require explicit numerical engineering, including algebraic reformulation, scaling strategies, stable kernels, and precision management.

