Skip to content

Non-Smooth Programs

A non-smooth program contains operations where the derivative is undefined, discontinuous, set-valued, or unstable under small perturbations. These programs arise naturally in...

A non-smooth program contains operations where the derivative is undefined, discontinuous, set-valued, or unstable under small perturbations. These programs arise naturally in optimization, machine learning, geometry, combinatorics, simulation, and systems code.

Examples include:

  • absolute value
  • maximum and minimum
  • clipping
  • sorting
  • thresholding
  • indexing
  • quantization
  • branching
  • discrete sampling
  • collision detection
  • contact mechanics

Automatic differentiation can still execute on such programs, but the returned gradients require careful interpretation.

Smoothness and Classical Derivatives

A classical derivative assumes local linear approximation.

A function ff is differentiable at xx if there exists a linear map LL such that:

f(x+h)=f(x)+L(h)+o(h) f(x+h)=f(x)+L(h)+o(\|h\|)

as h0h\to0.

Non-smooth programs violate this assumption at some points.

For example:

f(x)=x f(x)=|x|

has no derivative at x=0x=0 because:

limh0h0h=1 \lim_{h\to0^-}\frac{|h|-0}{h}=-1

while:

limh0+h0h=1. \lim_{h\to0^+}\frac{|h|-0}{h}=1.

There is no single linear approximation valid from all directions.

AD systems do not generally test differentiability. They apply local propagation rules to the executed operations.

Sources of Non-Smoothness

Non-smoothness enters programs through several mechanisms.

Piecewise Definitions

if x > 0:
    y = x
else:
    y = -x

This produces kinks or discontinuities at branch boundaries.

Discrete Decisions

i = argmax(x)
y = table[i]

The selected index changes discontinuously.

Integer and Boolean Operations

y = floor(x)
z = x > 0

These operations are locally constant almost everywhere.

Geometric Events

if collision(a, b):
    resolve_contact()

Contact onset changes the active constraint structure.

Numerical Safeguards

x = max(x, eps)

Used to avoid division by zero or logarithm singularities.

Quantization

y = round(x)

Produces staircase functions.

What AD Actually Computes

AD computes derivatives of the executed computational trace.

For example:

y = max(x, 0)

If x>0x>0, the trace is:

identity(x)

and the derivative is 11.

If x<0x<0, the trace is:

constant(0)

and the derivative is 00.

At x=0x=0, the trace depends on implementation details. The system may return:

0,1,12, 0,\quad 1,\quad \frac12,

or another convention.

The important distinction is:

AD computes the derivative of the executed trace,
not a proof of global smoothness.

Clarke Generalized Gradients

For locally Lipschitz functions, one can generalize derivatives using Clarke subdifferentials.

For a function f:RnRf:\mathbb{R}^n\to\mathbb{R}, the Clarke generalized gradient is:

Cf(x)=conv{limf(xk)} \partial_C f(x) = \operatorname{conv} \left\{ \lim \nabla f(x_k) \right\}

over nearby differentiable points xkx_k.

For ReLU:

f(x)=max(0,x) f(x)=\max(0,x)

the Clarke subdifferential at 00 is:

Cf(0)=[0,1]. \partial_C f(0)=[0,1].

Many AD systems implicitly choose one element from such a set.

This is mathematically weaker than differentiability, but often sufficient for optimization.

Directional Derivatives

Some non-smooth functions possess directional derivatives even when full derivatives do not exist.

The directional derivative of ff at xx in direction vv is:

Dvf(x)=limt0+f(x+tv)f(x)t. D_v f(x) = \lim_{t\to0^+} \frac{f(x+tv)-f(x)}{t}.

For:

f(x)=x, f(x)=|x|,

at x=0x=0:

Dvf(0)=v. D_v f(0)=|v|.

This exists for every direction, but it is not linear in vv, so it is not a Fréchet derivative.

Forward-mode AD approximates directional behavior along executed traces. This partly explains why AD can remain useful even for non-smooth programs.

Almost-Everywhere Differentiability

Many non-smooth functions are differentiable almost everywhere.

ReLU is differentiable everywhere except at 00.

Max pooling is differentiable except at ties.

Sorting is differentiable except where equal elements swap order.

This matters in optimization because random floating-point inputs almost never land exactly on a measure-zero boundary.

Hence gradient descent often works well despite theoretical non-smoothness.

However, this heuristic fails when:

  • boundaries are intentionally targeted
  • quantization dominates behavior
  • constraints are active
  • combinatorial structure matters
  • gradients repeatedly vanish
  • discrete routing controls execution

Vanishing Gradients from Non-Smoothness

Some non-smooth operations destroy gradient flow.

Example:

y = round(x)

Away from integer boundaries:

dydx=0. \frac{dy}{dx}=0.

Gradient-based optimization cannot meaningfully adjust x because the local derivative is zero almost everywhere.

The same issue occurs for:

argmax
top-k selection
thresholding
binary masks
integer indexing

The optimization signal disappears because small perturbations do not change the discrete outcome.

Straight-Through Estimators

A common engineering workaround is the straight-through estimator (STE).

Forward pass:

y = round(x)

Backward pass:

dy/dx = 1

This is intentionally inconsistent with the true derivative.

The primal computation is discrete, but the backward pass pretends the operation were identity-like.

This technique is widely used in:

  • quantized neural networks
  • binary networks
  • discrete latent variables
  • routing systems

Mathematically, STE defines a surrogate gradient rather than a true derivative.

Soft Relaxations

Another strategy replaces hard non-smooth operations with smooth approximations.

Instead of:

max(x1,x2), \max(x_1,x_2),

use:

softmaxτ(x1,x2)=τlog(ex1/τ+ex2/τ). \operatorname{softmax}_\tau(x_1,x_2) = \tau\log(e^{x_1/\tau}+e^{x_2/\tau}).

As τ0\tau\to0, this approaches the hard maximum.

Similarly:

Hard operationSmooth relaxation
ReLUSoftplus
ArgmaxSoftmax
Step functionSigmoid
Top-kDifferentiable sorting
Binary masksGumbel-softmax

Relaxations improve gradient flow but change the model.

Non-Smoothness in Optimization

Gradient methods on non-smooth functions behave differently from smooth optimization.

For smooth optimization:

xk+1=xkηf(xk) x_{k+1}=x_k-\eta\nabla f(x_k)

uses the local gradient direction.

For non-smooth optimization, one often uses:

xk+1=xkηgk x_{k+1}=x_k-\eta g_k

where:

gkf(xk). g_k\in\partial f(x_k).

This is subgradient descent.

Subgradient methods converge more slowly and lack many smooth optimization guarantees.

Nevertheless, they remain practical for convex non-smooth objectives such as:

  • L1 regularization
  • hinge loss
  • total variation
  • sparse optimization

Non-Smoothness in Physics and Geometry

Physical systems often contain contact events:

if penetration:
    apply_impulse()

The dynamics change discontinuously when contact begins or ends.

Rigid-body simulators, collision systems, and constraint solvers are therefore naturally non-smooth.

Differentiating such systems is difficult because:

  • contact topology changes
  • active constraints change
  • impulses create discontinuities
  • solver iterations may bifurcate

Modern differentiable simulators often use:

  • softened contacts
  • differentiable penalty forces
  • smooth collision approximations
  • implicit differentiation
  • custom adjoints

Non-Smoothness in Sparse and Discrete Systems

Sparse systems often use discrete structure:

indices = topk(scores)
y = gather(values, indices)

The selected sparsity pattern changes discontinuously.

This creates unstable gradients near selection boundaries.

Graph algorithms, retrieval systems, search systems, routing networks, and sparse mixture-of-experts models all encounter this issue.

Many such systems use hybrid approaches:

  • differentiate through scores
  • stop gradients through indices
  • use soft routing during training
  • use hard routing during inference

Detecting Non-Smooth Regions

AD systems generally do not detect non-smoothness automatically.

Possible indicators include:

NaNs
gradient spikes
zero gradients
unstable optimization
branch oscillation
sensitivity to tiny perturbations

Numerical finite differences may also become unstable near boundaries:

f(x+h)f(x)h \frac{f(x+h)-f(x)}{h}

can vary dramatically with step size and direction.

This is not an AD failure. The underlying mathematical derivative is unstable or undefined.

Generalized Differentiation Frameworks

Several mathematical frameworks extend classical differentiation:

FrameworkPurpose
SubgradientsConvex non-smooth functions
Clarke gradientsLocally Lipschitz functions
Viscosity solutionsPDEs with shocks
Distributional derivativesWeak derivatives
Bouligand derivativesDirectional generalized derivatives

Most practical AD systems do not explicitly implement these theories. They use local propagation rules combined with conventions.

Nevertheless, these frameworks explain why AD often remains useful on non-smooth programs.

Correctness Rule

For non-smooth programs:

Away from non-smooth boundaries:
    AD behaves like ordinary differentiation.

At boundaries:
    the classical derivative may not exist.

AD may return:
    a branch derivative,
    a subgradient convention,
    or a surrogate gradient.

Discrete operations:
    usually block gradient flow unless relaxed or overridden.

Non-smoothness therefore does not prevent automatic differentiation from running. It changes the meaning of the gradients being computed.