Skip to content

Overflow and Underflow

Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too...

Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too small to represent accurately, underflow occurs.

Automatic differentiation inherits both phenomena from the primal computation and may amplify them during derivative propagation.

Overflow and underflow are not rare edge cases. They appear routinely in:

  • deep neural networks,
  • exponential models,
  • probabilistic inference,
  • scientific simulations,
  • recurrent systems,
  • optimization algorithms,
  • and mixed-precision training.

Understanding these effects is essential for building stable differentiable systems.

Floating Point Range

A floating point format has:

  • finite precision,
  • finite exponent range,
  • and finite density of representable values.

For IEEE 754 float32:

QuantityApproximate value
Largest finite value3.4×10383.4 \times 10^{38}
Smallest normal positive value1.18×10381.18 \times 10^{-38}
Smallest subnormal positive value1.4×10451.4 \times 10^{-45}

For float16:

QuantityApproximate value
Largest finite value6.55×1046.55 \times 10^4
Smallest normal positive value6.10×1056.10 \times 10^{-5}
Smallest subnormal positive value5.96×1085.96 \times 10^{-8}

The much narrower range of float16 explains why mixed-precision systems are particularly vulnerable.

Overflow

Overflow occurs when:

x>xmax. |x| > x_{\max}.

The result becomes:

  • ++\infty,
  • -\infty,
  • or sometimes NaN after later operations.

Overflow often spreads rapidly through computational graphs.

Exponential Growth

The exponential function is the canonical example:

f(x)=ex f(x)=e^x

In float32:

e88.73.4×1038, e^{88.7} \approx 3.4 \times 10^{38},

which is already near the largest representable value.

Thus:

e100= e^{100} = \infty

in float32 arithmetic.

The derivative is identical:

ddxex=ex. \frac{d}{dx} e^x = e^x.

So the gradient also overflows.

This is common in softmax layers, partition functions, probabilistic normalization, and energy-based models.

Overflow in Reverse Mode

Reverse mode often amplifies overflow.

Suppose:

y=ex. y = e^x.

The reverse rule is:

xˉ+=yˉex. \bar{x} \mathrel{+}= \bar{y} e^x.

If exe^x overflowed during the forward pass, then the backward pass propagates infinities.

Even worse, later operations may produce:

0, 0 \cdot \infty,

which yields NaN.

Once NaNs appear in a reverse graph, they frequently contaminate large portions of the gradient computation.

Underflow

Underflow occurs when:

0<x<xmin. 0 < |x| < x_{\min}.

The value becomes either:

  • a subnormal number,
  • or exactly zero.

Subnormal numbers extend the representable range below the normal threshold, but with reduced precision.

Eventually values collapse to zero entirely.

Exponential Decay

Consider:

f(x)=ex f(x)=e^{-x}

For large xx:

e1003.7×1044. e^{-100} \approx 3.7 \times 10^{-44}.

In float32, this is near the subnormal region.

For larger xx, the value underflows to zero.

The derivative behaves identically:

ddxex=ex. \frac{d}{dx} e^{-x} = -e^{-x}.

So gradients disappear.

This is one source of vanishing gradients.

Subnormal Numbers

Subnormal numbers fill the gap between:

0 0

and the smallest normal floating point value.

They preserve gradual underflow.

Without subnormals, tiny values would abruptly jump to zero.

However, subnormals have costs:

IssueDescription
Reduced precisionFewer significant bits
Slow executionSome hardware handles them slowly
Numerical instabilityRelative error becomes large

Many accelerators therefore flush subnormals to zero for performance reasons.

This improves throughput but worsens gradient preservation.

Overflow in Products

Repeated multiplication easily overflows.

Suppose:

xn=k=1nak. x_n = \prod_{k=1}^n a_k.

If:

ak>1, |a_k| > 1,

then magnitudes grow exponentially.

For example:

21283.4×1038, 2^{128} \approx 3.4 \times 10^{38},

already near float32 overflow.

Reverse-mode differentiation through long multiplicative chains is therefore extremely unstable.

Underflow in Products

Likewise:

ak<1 |a_k| < 1

causes exponential decay.

Example:

0.51507×1046. 0.5^{150} \approx 7 \times 10^{-46}.

This underflows in float32.

The corresponding gradients also collapse.

This is one reason why sigmoid and tanh recurrent networks historically suffered severe vanishing-gradient problems.

Overflow from Squaring

Squaring magnifies magnitude rapidly:

x2,x4,x8, x^2, x^4, x^8, \dots

Even moderate values can overflow after repeated squaring.

Example:

(1010)4=1040, (10^{10})^4 = 10^{40},

which exceeds float32 range.

Gradients grow similarly:

ddxxn=nxn1. \frac{d}{dx} x^n = nx^{n-1}.

Large exponents create extreme sensitivity.

Logarithmic Stabilization

Many unstable computations can be reformulated using logarithms.

Instead of:

ipi, \prod_i p_i,

compute:

ilogpi. \sum_i \log p_i.

Instead of:

ex, e^x,

use log-domain representations whenever possible.

Probabilistic systems rely heavily on this principle.

Stable Softmax

Naive softmax:

softmax(xi)=exijexj \operatorname{softmax}(x_i) = \frac{e^{x_i}} {\sum_j e^{x_j}}

is numerically dangerous.

Large positive logits overflow.

The stable form subtracts the maximum:

m=maxjxj, m = \max_j x_j, softmax(xi)=eximjexjm. \operatorname{softmax}(x_i) = \frac{e^{x_i-m}} {\sum_j e^{x_j-m}}.

This transformation preserves the mathematical result because:

eximjexjm=exijexj. \frac{e^{x_i-m}} {\sum_j e^{x_j-m}} = \frac{e^{x_i}} {\sum_j e^{x_j}}.

But numerically, all exponentials are now bounded by:

e0=1. e^0 = 1.

This dramatically improves both forward and backward stability.

Log-Sum-Exp

The same principle applies to:

logiexi. \log \sum_i e^{x_i}.

Direct evaluation may overflow.

The stable form is:

m+logiexim. m + \log \sum_i e^{x_i-m}.

This operation appears constantly in:

  • probabilistic models,
  • partition functions,
  • attention mechanisms,
  • variational inference,
  • and energy-based learning.

Sigmoid Instability

The sigmoid function:

σ(x)=11+ex \sigma(x) = \frac{1}{1+e^{-x}}

contains exponentials.

For large negative xx:

ex e^{-x}

overflows.

A stable implementation uses branching:

σ(x)={11+exx0ex1+exx<0 \sigma(x) = \begin{cases} \frac{1}{1+e^{-x}} & x \geq 0 \\ \frac{e^x}{1+e^x} & x < 0 \end{cases}

This avoids overflow in both regions.

NaN Generation

Overflow and underflow often produce NaNs indirectly.

Common invalid operations include:

OperationResult
\infty - \inftyNaN
00 \cdot \inftyNaN
00\frac{0}{0}NaN
\frac{\infty}{\infty}NaN
log0\log 0-\infty
1\sqrt{-1}NaN

NaNs are particularly dangerous because they propagate silently through many tensor operations.

Overflow in Hessians

Second-order derivatives amplify instability further.

Consider:

f(x)=ex. f(x)=e^x.

Then:

f(x)=ex, f'(x)=e^x, f(x)=ex. f''(x)=e^x.

Every derivative level inherits the same exponential growth.

For more complex functions, higher derivatives often grow even faster.

This makes higher-order AD significantly more numerically fragile.

Mixed Precision Systems

Mixed precision introduces additional overflow and underflow risks.

Float16 has very limited range:

TypeMax value
float166.55×1046.55 \times 10^4
float323.4×10383.4 \times 10^{38}

Values safe in float32 may overflow immediately in float16.

Gradients safe in float32 may underflow to zero in float16.

This motivates:

  • loss scaling,
  • higher precision accumulators,
  • fused stable kernels,
  • and normalization strategies.

Loss Scaling

Suppose gradients are extremely small:

L1012. \nabla L \approx 10^{-12}.

In float16, they may underflow.

Loss scaling multiplies the loss by:

α. \alpha.

Backpropagation computes:

(αL)=αL. \nabla (\alpha L) = \alpha \nabla L.

Gradients remain representable during propagation. They are divided by α\alpha afterward.

Dynamic loss scaling automatically adjusts α\alpha to balance overflow and underflow risk.

Stable Reduction Algorithms

Large reductions can overflow even when individual terms are safe.

Example:

i=11091030. \sum_{i=1}^{10^9} 10^{30}.

Intermediate sums may exceed representable range.

Stable reduction strategies include:

MethodPurpose
Pairwise summationReduce accumulation error
Kahan summationCompensate rounding
Block scalingNormalize partial sums
Log-domain accumulationAvoid magnitude explosion

Gradient Clipping

Gradient clipping prevents overflow during optimization.

Given gradient gg:

ggmin(1,τg). g \leftarrow g \cdot \min\left( 1, \frac{\tau}{\|g\|} \right).

This limits gradient magnitude.

Clipping does not solve conditioning problems, but it prevents catastrophic updates.

Overflow in Optimizers

Optimization algorithms also suffer overflow.

Example:

vt=βvt1+(1β)gt2. v_t = \beta v_{t-1} + (1-\beta)g_t^2.

Large gradients may overflow the squared term:

gt2. g_t^2.

Adaptive optimizers therefore require careful numerical design.

Even simple expressions like:

vt \sqrt{v_t}

may produce NaNs if vtv_t becomes invalid.

Hardware Effects

Different hardware handles overflow differently.

HardwareCommon behavior
CPUFull IEEE semantics
GPUSometimes reduced precision
TPUAggressive fused execution
AI acceleratorsMay flush subnormals

Fused kernels may change rounding behavior.

Compiler optimizations may reorder arithmetic.

Two mathematically identical programs may therefore exhibit different overflow behavior across devices.

Detecting Overflow and Underflow

Practical systems monitor:

SignalMeaning
NaN gradientsInvalid arithmetic
Inf activationsOverflow
Zero gradientsPossible underflow
Sudden loss spikesNumerical instability
Divergent optimizationExploding gradients

Instrumentation is essential in large systems.

Stable Numerical Design

A stable differentiable program typically follows several principles:

PrinciplePurpose
Normalize magnitudesPrevent extreme scales
Use log-domain algebraAvoid exponential blowup
Avoid subtracting large close valuesReduce cancellation
Use stable fused operatorsImprove local conditioning
Prefer bounded activations carefullyControl gradient magnitude
Use residual pathwaysPreserve gradient flow
Keep accumulators higher precisionReduce rounding loss

Core Idea

Overflow and underflow are fundamental consequences of finite-range arithmetic. Automatic differentiation propagates and often amplifies these numerical effects because gradients inherit the scaling behavior of the primal computation. Stable differentiable systems therefore require explicit numerical engineering, including algebraic reformulation, scaling strategies, stable kernels, and precision management.