Non-Smooth Programs

A non-smooth program contains operations where the derivative is undefined, discontinuous, set-valued, or unstable under small perturbations. These programs arise naturally in optimization, machine learning, geometry, combinatorics, simulation, and systems code.

Examples include:

absolute value
maximum and minimum
clipping
sorting
thresholding
indexing
quantization
branching
discrete sampling
collision detection
contact mechanics

Automatic differentiation can still execute on such programs, but the returned gradients require careful interpretation.

Smoothness and Classical Derivatives

A classical derivative assumes local linear approximation.

A function $f$ is differentiable at $x$ if there exists a linear map $L$ such that:

f(x+h)=f(x)+L(h)+o(\|h\|)

as $h\to0$ .

Non-smooth programs violate this assumption at some points.

For example:

f(x)=|x|

has no derivative at $x=0$ because:

\lim_{h\to0^-}\frac{|h|-0}{h}=-1

while:

\lim_{h\to0^+}\frac{|h|-0}{h}=1.

There is no single linear approximation valid from all directions.

AD systems do not generally test differentiability. They apply local propagation rules to the executed operations.

Sources of Non-Smoothness

Non-smoothness enters programs through several mechanisms.

Piecewise Definitions

if x > 0:
    y = x
else:
    y = -x

This produces kinks or discontinuities at branch boundaries.

Discrete Decisions

i = argmax(x)
y = table[i]

The selected index changes discontinuously.

Integer and Boolean Operations

y = floor(x)
z = x > 0

These operations are locally constant almost everywhere.

Geometric Events

if collision(a, b):
    resolve_contact()

Contact onset changes the active constraint structure.

Numerical Safeguards

x = max(x, eps)

Used to avoid division by zero or logarithm singularities.

Quantization

y = round(x)

Produces staircase functions.

What AD Actually Computes

AD computes derivatives of the executed computational trace.

For example:

y = max(x, 0)

If $x>0$ , the trace is:

identity(x)

and the derivative is $1$ .

If $x<0$ , the trace is:

constant(0)

and the derivative is $0$ .

At $x=0$ , the trace depends on implementation details. The system may return:

0,\quad 1,\quad \frac12,

or another convention.

The important distinction is:

AD computes the derivative of the executed trace,
not a proof of global smoothness.

Clarke Generalized Gradients

For locally Lipschitz functions, one can generalize derivatives using Clarke subdifferentials.

For a function $f:\mathbb{R}^n\to\mathbb{R}$ , the Clarke generalized gradient is:

\partial_C f(x) = \operatorname{conv} \left\{ \lim \nabla f(x_k) \right\}

over nearby differentiable points $x_k$ .

For ReLU:

f(x)=\max(0,x)

the Clarke subdifferential at $0$ is:

\partial_C f(0)=[0,1].

Many AD systems implicitly choose one element from such a set.

This is mathematically weaker than differentiability, but often sufficient for optimization.

Directional Derivatives

Some non-smooth functions possess directional derivatives even when full derivatives do not exist.

The directional derivative of $f$ at $x$ in direction $v$ is:

D_v f(x) = \lim_{t\to0^+} \frac{f(x+tv)-f(x)}{t}.

For:

f(x)=|x|,

at $x=0$ :

D_v f(0)=|v|.

This exists for every direction, but it is not linear in $v$ , so it is not a Fréchet derivative.

Forward-mode AD approximates directional behavior along executed traces. This partly explains why AD can remain useful even for non-smooth programs.

Almost-Everywhere Differentiability

Many non-smooth functions are differentiable almost everywhere.

ReLU is differentiable everywhere except at $0$ .

Max pooling is differentiable except at ties.

Sorting is differentiable except where equal elements swap order.

This matters in optimization because random floating-point inputs almost never land exactly on a measure-zero boundary.

Hence gradient descent often works well despite theoretical non-smoothness.

However, this heuristic fails when:

boundaries are intentionally targeted
quantization dominates behavior
constraints are active
combinatorial structure matters
gradients repeatedly vanish
discrete routing controls execution

Vanishing Gradients from Non-Smoothness

Some non-smooth operations destroy gradient flow.

Example:

y = round(x)

Away from integer boundaries:

\frac{dy}{dx}=0.

Gradient-based optimization cannot meaningfully adjust x because the local derivative is zero almost everywhere.

The same issue occurs for:

argmax
top-k selection
thresholding
binary masks
integer indexing

The optimization signal disappears because small perturbations do not change the discrete outcome.

Straight-Through Estimators

A common engineering workaround is the straight-through estimator (STE).

Forward pass:

y = round(x)

Backward pass:

dy/dx = 1

This is intentionally inconsistent with the true derivative.

The primal computation is discrete, but the backward pass pretends the operation were identity-like.

This technique is widely used in:

quantized neural networks
binary networks
discrete latent variables
routing systems

Mathematically, STE defines a surrogate gradient rather than a true derivative.

Soft Relaxations

Another strategy replaces hard non-smooth operations with smooth approximations.

Instead of:

\max(x_1,x_2),

use:

\operatorname{softmax}_\tau(x_1,x_2) = \tau\log(e^{x_1/\tau}+e^{x_2/\tau}).

As $\tau\to0$ , this approaches the hard maximum.

Similarly:

Hard operation	Smooth relaxation
ReLU	Softplus
Argmax	Softmax
Step function	Sigmoid
Top-k	Differentiable sorting
Binary masks	Gumbel-softmax

Relaxations improve gradient flow but change the model.

Non-Smoothness in Optimization

Gradient methods on non-smooth functions behave differently from smooth optimization.

For smooth optimization:

x_{k+1}=x_k-\eta\nabla f(x_k)

uses the local gradient direction.

For non-smooth optimization, one often uses:

x_{k+1}=x_k-\eta g_k

where:

g_k\in\partial f(x_k).

This is subgradient descent.

Subgradient methods converge more slowly and lack many smooth optimization guarantees.

Nevertheless, they remain practical for convex non-smooth objectives such as:

L1 regularization
hinge loss
total variation
sparse optimization

Non-Smoothness in Physics and Geometry

Physical systems often contain contact events:

if penetration:
    apply_impulse()

The dynamics change discontinuously when contact begins or ends.

Rigid-body simulators, collision systems, and constraint solvers are therefore naturally non-smooth.

Differentiating such systems is difficult because:

contact topology changes
active constraints change
impulses create discontinuities
solver iterations may bifurcate

Modern differentiable simulators often use:

softened contacts
differentiable penalty forces
smooth collision approximations
implicit differentiation
custom adjoints

Non-Smoothness in Sparse and Discrete Systems

Sparse systems often use discrete structure:

indices = topk(scores)
y = gather(values, indices)

The selected sparsity pattern changes discontinuously.

This creates unstable gradients near selection boundaries.

Graph algorithms, retrieval systems, search systems, routing networks, and sparse mixture-of-experts models all encounter this issue.

Many such systems use hybrid approaches:

differentiate through scores
stop gradients through indices
use soft routing during training
use hard routing during inference

Detecting Non-Smooth Regions

AD systems generally do not detect non-smoothness automatically.

Possible indicators include:

NaNs
gradient spikes
zero gradients
unstable optimization
branch oscillation
sensitivity to tiny perturbations

Numerical finite differences may also become unstable near boundaries:

\frac{f(x+h)-f(x)}{h}

can vary dramatically with step size and direction.

This is not an AD failure. The underlying mathematical derivative is unstable or undefined.

Generalized Differentiation Frameworks

Several mathematical frameworks extend classical differentiation:

Framework	Purpose
Subgradients	Convex non-smooth functions
Clarke gradients	Locally Lipschitz functions
Viscosity solutions	PDEs with shocks
Distributional derivatives	Weak derivatives
Bouligand derivatives	Directional generalized derivatives

Most practical AD systems do not explicitly implement these theories. They use local propagation rules combined with conventions.

Nevertheless, these frameworks explain why AD often remains useful on non-smooth programs.

Correctness Rule

For non-smooth programs:

Away from non-smooth boundaries:
    AD behaves like ordinary differentiation.

At boundaries:
    the classical derivative may not exist.

AD may return:
    a branch derivative,
    a subgradient convention,
    or a surrogate gradient.

Discrete operations:
    usually block gradient flow unless relaxed or overridden.

Non-smoothness therefore does not prevent automatic differentiation from running. It changes the meaning of the gradients being computed.