Skip to content

Exception Handling and Undefined Regions

Programs do not only branch between valid computations. They also fail, stop early, raise exceptions, return sentinel values, or enter undefined numerical regions. These...

Programs do not only branch between valid computations. They also fail, stop early, raise exceptions, return sentinel values, or enter undefined numerical regions. These behaviors matter for automatic differentiation because a derivative only has meaning where the primal computation itself has a well-defined value.

A mathematical function may be partial:

f:DY f : D \to Y

where DD is the subset of inputs for which the function is defined.

For example,

f(x)=log(x) f(x)=\log(x)

is defined over x>0x>0. It has no real-valued output for x0x\le 0. A program that computes log(x) therefore has an undefined region unless it guards the input.

Undefined Operations

Many primitive operations have restricted domains:

OperationUndefined or problematic region
log(x)x0x \le 0
sqrt(x)x<0x < 0 over the reals
1 / xx=0x = 0
acos(x)x<1x < -1 or x>1x > 1
x / yy=0y = 0
matrix inversesingular matrices
Cholesky factorizationnon-positive-definite matrices

AD does not make these operations safe. It differentiates them where they are valid.

For example:

ddxlog(x)=1x \frac{d}{dx}\log(x)=\frac{1}{x}

This derivative is valid only for x>0x>0. Near x=0x=0, the derivative becomes unbounded. At x=0x=0, the primal value is undefined.

Exceptions as Control Flow

A program may signal invalid inputs by raising an exception:

def f(x):
    if x <= 0:
        raise ValueError("x must be positive")
    return log(x)

This describes a partial function. Its domain is:

D={xx>0}. D=\{x\mid x>0\}.

AD can differentiate the successful path:

x > 0:
    y = log(x)
    dy/dx = 1/x

But there is no derivative for the exception path, because the function returns no value.

The derivative of a failed computation is not zero. It is undefined.

Sentinel Values

Some programs avoid exceptions by returning sentinel values:

def f(x):
    if x <= 0:
        return 0
    return log(x)

This changes the mathematical function:

f(x)={0,x0,log(x),x>0. f(x)= \begin{cases} 0, & x\le 0,\\ \log(x), & x>0. \end{cases}

The function is discontinuous at x=0x=0. AD will compute:

f(x)=0 f'(x)=0

for x<0x<0, and

f(x)=1/x f'(x)=1/x

for x>0x>0. At x=0x=0, the derivative does not exist.

Returning a sentinel value therefore creates a piecewise function, not a safe version of the original smooth function.

NaNs and Infinities

Floating-point programs often represent undefined or unstable regions with NaN, +Inf, or -Inf.

y = log(x)

For x<0x<0, the result may be NaN. For x=0x=0, it may be -Inf.

Gradients may then become NaN or Inf as well. Once a NaN enters the computation graph, it can contaminate many downstream values.

A common error is to assume that masking removes invalid computations:

y = where(x > 0, log(x), 0)

In some array systems, both branches may be evaluated before selection. If so, log(x) may still be computed for invalid negative values. The output may look masked, while the gradient or saved intermediates contain invalid values.

A safer form is:

safe_x = where(x > 0, x, 1)
y = where(x > 0, log(safe_x), 0)

Now log receives a valid input on every element.

Domain Guards

A domain guard restricts inputs before applying an operation.

For example:

x_safe = max(x, eps)
y = log(x_safe)

This avoids log(0) and log(negative).

But it changes the derivative. The function is:

y=log(max(x,ϵ)). y=\log(\max(x,\epsilon)).

Its derivative is:

dydx={0,x<ϵ,1/x,x>ϵ. \frac{dy}{dx} = \begin{cases} 0, & x<\epsilon,\\ 1/x, & x>\epsilon. \end{cases}

At x=ϵx=\epsilon, the function has a kink.

Domain guards are often necessary, but they should be treated as part of the mathematical model.

Clamping and Gradient Bias

Clamping stabilizes computation but can bias gradients.

p = clamp(p, eps, 1 - eps)
loss = -log(p)

This prevents infinite loss when p=0p=0, but when p<ϵp<\epsilon, the derivative with respect to the original pp is zero if the clamp is differentiated literally.

That may block learning exactly where the model is very wrong.

An alternative is to stabilize the formula rather than clamp the differentiable variable. For example, many systems provide numerically stable primitives such as:

log_softmax
softplus
logsumexp
cross_entropy_with_logits

These preserve useful gradients while avoiding overflow and underflow.

Try-Catch Blocks

Exception handling can also be used as ordinary control flow:

try:
    y = solve(A, b)
except SingularMatrix:
    y = zeros_like(b)

This defines a piecewise program. One branch solves a linear system. The other returns a fallback value.

The derivative through the successful branch is the derivative of the solver. The derivative through the fallback branch is usually zero with respect to A and b, unless the fallback itself depends on them.

At the boundary where A becomes singular, the derivative of the solve operation is typically unbounded or undefined. The fallback does not make the transition smooth.

Assertions

Assertions document assumptions:

assert x > 0
y = log(x)

An assertion does not add a differentiable operation. It restricts the allowed domain.

If the assertion holds, AD differentiates the following computation. If the assertion fails, the function has no value for that input.

Assertions are useful because they make domain assumptions explicit. They are preferable to silent invalid values.

Undefined Gradients

Some primitives have defined primal values but undefined or intentionally unsupported gradients.

Examples include:

argmax(x)
round(x)
int(x)
hash(x)
lookup(table, key)

For some of these, the classical derivative is zero almost everywhere. For others, the operation maps into a discrete space, so the usual real derivative does not apply.

AD systems may:

raise an error
return zero
stop the gradient
use a custom surrogate

Each choice has different semantics.

An error is often the safest default because it prevents the user from mistaking a convention for a derivative.

Partial Functions and Program Boundaries

A differentiable program should specify its valid input domain.

For example:

def loss(logits, target):
    return cross_entropy(logits, target)

Here target is usually an integer class label. The loss is differentiable with respect to logits, but not with respect to target.

The domain includes:

logits: floating-point tensor
target: integer labels in valid class range

A derivative request outside this contract should fail or return no gradient for the non-differentiable input.

This is a type-level issue as much as a numerical issue.

Custom Error Semantics

A robust AD system should make failure modes explicit.

Common policies include:

PolicyMeaning
Raise on invalid primalStop when the forward computation is undefined
Raise on invalid gradientStop when backward produces invalid values
Propagate NaNsPreserve floating-point behavior
Mask invalid regionsTreat selected regions as inactive
Return zero gradientUse a convention
Custom adjointDefine explicit backward behavior

No single policy is correct for all programs. Scientific computing often prefers explicit failure. Deep learning frameworks often prefer throughput and propagate floating-point invalid values unless anomaly detection is enabled.

Correctness Rule

Exception handling and undefined regions turn total smooth functions into partial or piecewise programs.

Successful valid path:
    AD differentiates the executed computation.

Exception path:
    no derivative exists unless a fallback value is returned.

Fallback value:
    defines a new piecewise function.

Masked invalid operation:
    unsafe if invalid branch is still evaluated.

Domain guard:
    stabilizes the primal computation but changes the derivative.

The practical rule is to make domains explicit, avoid hidden invalid operations, and treat every guard, fallback, clamp, and exception handler as part of the function being differentiated.