Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too...
Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too small to represent accurately, underflow occurs.
Automatic differentiation inherits both phenomena from the primal computation and may amplify them during derivative propagation.
Overflow and underflow are not rare edge cases. They appear routinely in:
- deep neural networks,
- exponential models,
- probabilistic inference,
- scientific simulations,
- recurrent systems,
- optimization algorithms,
- and mixed-precision training.
Understanding these effects is essential for building stable differentiable systems.
Floating Point Range
A floating point format has:
- finite precision,
- finite exponent range,
- and finite density of representable values.
For IEEE 754 float32:
| Quantity | Approximate value |
|---|---|
| Largest finite value | |
| Smallest normal positive value | |
| Smallest subnormal positive value |
For float16:
| Quantity | Approximate value |
|---|---|
| Largest finite value | |
| Smallest normal positive value | |
| Smallest subnormal positive value |
The much narrower range of float16 explains why mixed-precision systems are particularly vulnerable.
Overflow
Overflow occurs when:
The result becomes:
- ,
- ,
- or sometimes NaN after later operations.
Overflow often spreads rapidly through computational graphs.
Exponential Growth
The exponential function is the canonical example:
In float32:
which is already near the largest representable value.
Thus:
in float32 arithmetic.
The derivative is identical:
So the gradient also overflows.
This is common in softmax layers, partition functions, probabilistic normalization, and energy-based models.
Overflow in Reverse Mode
Reverse mode often amplifies overflow.
Suppose:
The reverse rule is:
If overflowed during the forward pass, then the backward pass propagates infinities.
Even worse, later operations may produce:
which yields NaN.
Once NaNs appear in a reverse graph, they frequently contaminate large portions of the gradient computation.
Underflow
Underflow occurs when:
The value becomes either:
- a subnormal number,
- or exactly zero.
Subnormal numbers extend the representable range below the normal threshold, but with reduced precision.
Eventually values collapse to zero entirely.
Exponential Decay
Consider:
For large :
In float32, this is near the subnormal region.
For larger , the value underflows to zero.
The derivative behaves identically:
So gradients disappear.
This is one source of vanishing gradients.
Subnormal Numbers
Subnormal numbers fill the gap between:
and the smallest normal floating point value.
They preserve gradual underflow.
Without subnormals, tiny values would abruptly jump to zero.
However, subnormals have costs:
| Issue | Description |
|---|---|
| Reduced precision | Fewer significant bits |
| Slow execution | Some hardware handles them slowly |
| Numerical instability | Relative error becomes large |
Many accelerators therefore flush subnormals to zero for performance reasons.
This improves throughput but worsens gradient preservation.
Overflow in Products
Repeated multiplication easily overflows.
Suppose:
If:
then magnitudes grow exponentially.
For example:
already near float32 overflow.
Reverse-mode differentiation through long multiplicative chains is therefore extremely unstable.
Underflow in Products
Likewise:
causes exponential decay.
Example:
This underflows in float32.
The corresponding gradients also collapse.
This is one reason why sigmoid and tanh recurrent networks historically suffered severe vanishing-gradient problems.
Overflow from Squaring
Squaring magnifies magnitude rapidly:
Even moderate values can overflow after repeated squaring.
Example:
which exceeds float32 range.
Gradients grow similarly:
Large exponents create extreme sensitivity.
Logarithmic Stabilization
Many unstable computations can be reformulated using logarithms.
Instead of:
compute:
Instead of:
use log-domain representations whenever possible.
Probabilistic systems rely heavily on this principle.
Stable Softmax
Naive softmax:
is numerically dangerous.
Large positive logits overflow.
The stable form subtracts the maximum:
This transformation preserves the mathematical result because:
But numerically, all exponentials are now bounded by:
This dramatically improves both forward and backward stability.
Log-Sum-Exp
The same principle applies to:
Direct evaluation may overflow.
The stable form is:
This operation appears constantly in:
- probabilistic models,
- partition functions,
- attention mechanisms,
- variational inference,
- and energy-based learning.
Sigmoid Instability
The sigmoid function:
contains exponentials.
For large negative :
overflows.
A stable implementation uses branching:
This avoids overflow in both regions.
NaN Generation
Overflow and underflow often produce NaNs indirectly.
Common invalid operations include:
| Operation | Result |
|---|---|
| NaN | |
| NaN | |
| NaN | |
| NaN | |
| NaN |
NaNs are particularly dangerous because they propagate silently through many tensor operations.
Overflow in Hessians
Second-order derivatives amplify instability further.
Consider:
Then:
Every derivative level inherits the same exponential growth.
For more complex functions, higher derivatives often grow even faster.
This makes higher-order AD significantly more numerically fragile.
Mixed Precision Systems
Mixed precision introduces additional overflow and underflow risks.
Float16 has very limited range:
| Type | Max value |
|---|---|
| float16 | |
| float32 |
Values safe in float32 may overflow immediately in float16.
Gradients safe in float32 may underflow to zero in float16.
This motivates:
- loss scaling,
- higher precision accumulators,
- fused stable kernels,
- and normalization strategies.
Loss Scaling
Suppose gradients are extremely small:
In float16, they may underflow.
Loss scaling multiplies the loss by:
Backpropagation computes:
Gradients remain representable during propagation. They are divided by afterward.
Dynamic loss scaling automatically adjusts to balance overflow and underflow risk.
Stable Reduction Algorithms
Large reductions can overflow even when individual terms are safe.
Example:
Intermediate sums may exceed representable range.
Stable reduction strategies include:
| Method | Purpose |
|---|---|
| Pairwise summation | Reduce accumulation error |
| Kahan summation | Compensate rounding |
| Block scaling | Normalize partial sums |
| Log-domain accumulation | Avoid magnitude explosion |
Gradient Clipping
Gradient clipping prevents overflow during optimization.
Given gradient :
This limits gradient magnitude.
Clipping does not solve conditioning problems, but it prevents catastrophic updates.
Overflow in Optimizers
Optimization algorithms also suffer overflow.
Example:
Large gradients may overflow the squared term:
Adaptive optimizers therefore require careful numerical design.
Even simple expressions like:
may produce NaNs if becomes invalid.
Hardware Effects
Different hardware handles overflow differently.
| Hardware | Common behavior |
|---|---|
| CPU | Full IEEE semantics |
| GPU | Sometimes reduced precision |
| TPU | Aggressive fused execution |
| AI accelerators | May flush subnormals |
Fused kernels may change rounding behavior.
Compiler optimizations may reorder arithmetic.
Two mathematically identical programs may therefore exhibit different overflow behavior across devices.
Detecting Overflow and Underflow
Practical systems monitor:
| Signal | Meaning |
|---|---|
| NaN gradients | Invalid arithmetic |
| Inf activations | Overflow |
| Zero gradients | Possible underflow |
| Sudden loss spikes | Numerical instability |
| Divergent optimization | Exploding gradients |
Instrumentation is essential in large systems.
Stable Numerical Design
A stable differentiable program typically follows several principles:
| Principle | Purpose |
|---|---|
| Normalize magnitudes | Prevent extreme scales |
| Use log-domain algebra | Avoid exponential blowup |
| Avoid subtracting large close values | Reduce cancellation |
| Use stable fused operators | Improve local conditioning |
| Prefer bounded activations carefully | Control gradient magnitude |
| Use residual pathways | Preserve gradient flow |
| Keep accumulators higher precision | Reduce rounding loss |
Core Idea
Overflow and underflow are fundamental consequences of finite-range arithmetic. Automatic differentiation propagates and often amplifies these numerical effects because gradients inherit the scaling behavior of the primal computation. Stable differentiable systems therefore require explicit numerical engineering, including algebraic reformulation, scaling strategies, stable kernels, and precision management.