Overflow and Underflow

Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too small to represent accurately, underflow occurs.

Automatic differentiation inherits both phenomena from the primal computation and may amplify them during derivative propagation.

Overflow and underflow are not rare edge cases. They appear routinely in:

deep neural networks,
exponential models,
probabilistic inference,
scientific simulations,
recurrent systems,
optimization algorithms,
and mixed-precision training.

Understanding these effects is essential for building stable differentiable systems.

Floating Point Range

A floating point format has:

finite precision,
finite exponent range,
and finite density of representable values.

For IEEE 754 float32:

Quantity	Approximate value
Largest finite value	$3.4 \times 10^{38}$
Smallest normal positive value	$1.18 \times 10^{-38}$
Smallest subnormal positive value	$1.4 \times 10^{-45}$

For float16:

Quantity	Approximate value
Largest finite value	$6.55 \times 10^4$
Smallest normal positive value	$6.10 \times 10^{-5}$
Smallest subnormal positive value	$5.96 \times 10^{-8}$

The much narrower range of float16 explains why mixed-precision systems are particularly vulnerable.

Overflow

Overflow occurs when:

|x| > x_{\max}.

The result becomes:

$+\infty$ ,
$-\infty$ ,
or sometimes NaN after later operations.

Overflow often spreads rapidly through computational graphs.

Exponential Growth

The exponential function is the canonical example:



f(x)=e^x

In float32:

e^{88.7} \approx 3.4 \times 10^{38},

which is already near the largest representable value.

Thus:

e^{100} = \infty

in float32 arithmetic.

The derivative is identical:

\frac{d}{dx} e^x = e^x.

So the gradient also overflows.

This is common in softmax layers, partition functions, probabilistic normalization, and energy-based models.

Overflow in Reverse Mode

Reverse mode often amplifies overflow.

Suppose:

y = e^x.

The reverse rule is:

\bar{x} \mathrel{+}= \bar{y} e^x.

If $e^x$ overflowed during the forward pass, then the backward pass propagates infinities.

Even worse, later operations may produce:

0 \cdot \infty,

which yields NaN.

Once NaNs appear in a reverse graph, they frequently contaminate large portions of the gradient computation.

Underflow

Underflow occurs when:

0 < |x| < x_{\min}.

The value becomes either:

a subnormal number,
or exactly zero.

Subnormal numbers extend the representable range below the normal threshold, but with reduced precision.

Eventually values collapse to zero entirely.

Exponential Decay

Consider:



f(x)=e^{-x}

For large $x$ :

e^{-100} \approx 3.7 \times 10^{-44}.

In float32, this is near the subnormal region.

For larger $x$ , the value underflows to zero.

The derivative behaves identically:

\frac{d}{dx} e^{-x} = -e^{-x}.

So gradients disappear.

This is one source of vanishing gradients.

Subnormal Numbers

Subnormal numbers fill the gap between:

0

and the smallest normal floating point value.

They preserve gradual underflow.

Without subnormals, tiny values would abruptly jump to zero.

However, subnormals have costs:

Issue	Description
Reduced precision	Fewer significant bits
Slow execution	Some hardware handles them slowly
Numerical instability	Relative error becomes large

Many accelerators therefore flush subnormals to zero for performance reasons.

This improves throughput but worsens gradient preservation.

Overflow in Products

Repeated multiplication easily overflows.

Suppose:

x_n = \prod_{k=1}^n a_k.

If:

|a_k| > 1,

then magnitudes grow exponentially.

For example:

2^{128} \approx 3.4 \times 10^{38},

already near float32 overflow.

Reverse-mode differentiation through long multiplicative chains is therefore extremely unstable.

Underflow in Products

Likewise:

|a_k| < 1

causes exponential decay.

Example:

0.5^{150} \approx 7 \times 10^{-46}.

This underflows in float32.

The corresponding gradients also collapse.

This is one reason why sigmoid and tanh recurrent networks historically suffered severe vanishing-gradient problems.

Overflow from Squaring

Squaring magnifies magnitude rapidly:

x^2, x^4, x^8, \dots

Even moderate values can overflow after repeated squaring.

Example:

(10^{10})^4 = 10^{40},

which exceeds float32 range.

Gradients grow similarly:

\frac{d}{dx} x^n = nx^{n-1}.

Large exponents create extreme sensitivity.

Logarithmic Stabilization

Many unstable computations can be reformulated using logarithms.

Instead of:

\prod_i p_i,

compute:

\sum_i \log p_i.

Instead of:

e^x,

use log-domain representations whenever possible.

Probabilistic systems rely heavily on this principle.

Stable Softmax

Naive softmax:

\operatorname{softmax}(x_i) = \frac{e^{x_i}} {\sum_j e^{x_j}}

is numerically dangerous.

Large positive logits overflow.

The stable form subtracts the maximum:

m = \max_j x_j,

\operatorname{softmax}(x_i) = \frac{e^{x_i-m}} {\sum_j e^{x_j-m}}.

This transformation preserves the mathematical result because:

\frac{e^{x_i-m}} {\sum_j e^{x_j-m}} = \frac{e^{x_i}} {\sum_j e^{x_j}}.

But numerically, all exponentials are now bounded by:

e^0 = 1.

This dramatically improves both forward and backward stability.

Log-Sum-Exp

The same principle applies to:

\log \sum_i e^{x_i}.

Direct evaluation may overflow.

The stable form is:

m + \log \sum_i e^{x_i-m}.

This operation appears constantly in:

probabilistic models,
partition functions,
attention mechanisms,
variational inference,
and energy-based learning.

Sigmoid Instability

The sigmoid function:

\sigma(x) = \frac{1}{1+e^{-x}}

contains exponentials.

For large negative $x$ :

e^{-x}

overflows.

A stable implementation uses branching:

\sigma(x) = \begin{cases} \frac{1}{1+e^{-x}} & x \geq 0 \\ \frac{e^x}{1+e^x} & x < 0 \end{cases}

This avoids overflow in both regions.

NaN Generation

Overflow and underflow often produce NaNs indirectly.

Common invalid operations include:

Operation	Result
$\infty - \infty$	NaN
$0 \cdot \infty$	NaN
$\frac{0}{0}$	NaN
$\frac{\infty}{\infty}$	NaN
$\log 0$	$-\infty$
$\sqrt{-1}$	NaN

NaNs are particularly dangerous because they propagate silently through many tensor operations.

Overflow in Hessians

Second-order derivatives amplify instability further.

Consider:

f(x)=e^x.

Then:

f'(x)=e^x,

f''(x)=e^x.

Every derivative level inherits the same exponential growth.

For more complex functions, higher derivatives often grow even faster.

This makes higher-order AD significantly more numerically fragile.

Mixed Precision Systems

Mixed precision introduces additional overflow and underflow risks.

Float16 has very limited range:

Type	Max value
float16	$6.55 \times 10^4$
float32	$3.4 \times 10^{38}$

Values safe in float32 may overflow immediately in float16.

Gradients safe in float32 may underflow to zero in float16.

This motivates:

loss scaling,
higher precision accumulators,
fused stable kernels,
and normalization strategies.

Loss Scaling

Suppose gradients are extremely small:

\nabla L \approx 10^{-12}.

In float16, they may underflow.

Loss scaling multiplies the loss by:

\alpha.

Backpropagation computes:

\nabla (\alpha L) = \alpha \nabla L.

Gradients remain representable during propagation. They are divided by $\alpha$ afterward.

Dynamic loss scaling automatically adjusts $\alpha$ to balance overflow and underflow risk.

Stable Reduction Algorithms

Large reductions can overflow even when individual terms are safe.

Example:

\sum_{i=1}^{10^9} 10^{30}.

Intermediate sums may exceed representable range.

Stable reduction strategies include:

Method	Purpose
Pairwise summation	Reduce accumulation error
Kahan summation	Compensate rounding
Block scaling	Normalize partial sums
Log-domain accumulation	Avoid magnitude explosion

Gradient Clipping

Gradient clipping prevents overflow during optimization.

Given gradient $g$ :

g \leftarrow g \cdot \min\left( 1, \frac{\tau}{\|g\|} \right).

This limits gradient magnitude.

Clipping does not solve conditioning problems, but it prevents catastrophic updates.

Overflow in Optimizers

Optimization algorithms also suffer overflow.

Example:

v_t = \beta v_{t-1} + (1-\beta)g_t^2.

Large gradients may overflow the squared term:

g_t^2.

Adaptive optimizers therefore require careful numerical design.

Even simple expressions like:

\sqrt{v_t}

may produce NaNs if $v_t$ becomes invalid.

Hardware Effects

Different hardware handles overflow differently.

Hardware	Common behavior
CPU	Full IEEE semantics
GPU	Sometimes reduced precision
TPU	Aggressive fused execution
AI accelerators	May flush subnormals

Fused kernels may change rounding behavior.

Compiler optimizations may reorder arithmetic.

Two mathematically identical programs may therefore exhibit different overflow behavior across devices.

Detecting Overflow and Underflow

Practical systems monitor:

Signal	Meaning
NaN gradients	Invalid arithmetic
Inf activations	Overflow
Zero gradients	Possible underflow
Sudden loss spikes	Numerical instability
Divergent optimization	Exploding gradients

Instrumentation is essential in large systems.

Stable Numerical Design

A stable differentiable program typically follows several principles:

Principle	Purpose
Normalize magnitudes	Prevent extreme scales
Use log-domain algebra	Avoid exponential blowup
Avoid subtracting large close values	Reduce cancellation
Use stable fused operators	Improve local conditioning
Prefer bounded activations carefully	Control gradient magnitude
Use residual pathways	Preserve gradient flow
Keep accumulators higher precision	Reduce rounding loss

Core Idea

Overflow and underflow are fundamental consequences of finite-range arithmetic. Automatic differentiation propagates and often amplifies these numerical effects because gradients inherit the scaling behavior of the primal computation. Stable differentiable systems therefore require explicit numerical engineering, including algebraic reformulation, scaling strategies, stable kernels, and precision management.