# Non-Smooth Programs

## Non-Smooth Programs

A non-smooth program contains operations where the derivative is undefined, discontinuous, set-valued, or unstable under small perturbations. These programs arise naturally in optimization, machine learning, geometry, combinatorics, simulation, and systems code.

Examples include:

- absolute value
- maximum and minimum
- clipping
- sorting
- thresholding
- indexing
- quantization
- branching
- discrete sampling
- collision detection
- contact mechanics

Automatic differentiation can still execute on such programs, but the returned gradients require careful interpretation.

## Smoothness and Classical Derivatives

A classical derivative assumes local linear approximation.

A function $f$ is differentiable at $x$ if there exists a linear map $L$ such that:

$$
f(x+h)=f(x)+L(h)+o(\|h\|)
$$

as $h\to0$.

Non-smooth programs violate this assumption at some points.

For example:

$$
f(x)=|x|
$$

has no derivative at $x=0$ because:

$$
\lim_{h\to0^-}\frac{|h|-0}{h}=-1
$$

while:

$$
\lim_{h\to0^+}\frac{|h|-0}{h}=1.
$$

There is no single linear approximation valid from all directions.

AD systems do not generally test differentiability. They apply local propagation rules to the executed operations.

## Sources of Non-Smoothness

Non-smoothness enters programs through several mechanisms.

### Piecewise Definitions

```text
if x > 0:
    y = x
else:
    y = -x
```

This produces kinks or discontinuities at branch boundaries.

### Discrete Decisions

```text
i = argmax(x)
y = table[i]
```

The selected index changes discontinuously.

### Integer and Boolean Operations

```text
y = floor(x)
z = x > 0
```

These operations are locally constant almost everywhere.

### Geometric Events

```text
if collision(a, b):
    resolve_contact()
```

Contact onset changes the active constraint structure.

### Numerical Safeguards

```text
x = max(x, eps)
```

Used to avoid division by zero or logarithm singularities.

### Quantization

```text
y = round(x)
```

Produces staircase functions.

## What AD Actually Computes

AD computes derivatives of the executed computational trace.

For example:

```text
y = max(x, 0)
```

If $x>0$, the trace is:

```text
identity(x)
```

and the derivative is $1$.

If $x<0$, the trace is:

```text
constant(0)
```

and the derivative is $0$.

At $x=0$, the trace depends on implementation details. The system may return:

$$
0,\quad 1,\quad \frac12,
$$

or another convention.

The important distinction is:

```text
AD computes the derivative of the executed trace,
not a proof of global smoothness.
```

## Clarke Generalized Gradients

For locally Lipschitz functions, one can generalize derivatives using Clarke subdifferentials.

For a function $f:\mathbb{R}^n\to\mathbb{R}$, the Clarke generalized gradient is:

$$
\partial_C f(x) =
\operatorname{conv}
\left\{
\lim \nabla f(x_k)
\right\}
$$

over nearby differentiable points $x_k$.

For ReLU:

$$
f(x)=\max(0,x)
$$

the Clarke subdifferential at $0$ is:

$$
\partial_C f(0)=[0,1].
$$

Many AD systems implicitly choose one element from such a set.

This is mathematically weaker than differentiability, but often sufficient for optimization.

## Directional Derivatives

Some non-smooth functions possess directional derivatives even when full derivatives do not exist.

The directional derivative of $f$ at $x$ in direction $v$ is:

$$
D_v f(x) =
\lim_{t\to0^+}
\frac{f(x+tv)-f(x)}{t}.
$$

For:

$$
f(x)=|x|,
$$

at $x=0$:

$$
D_v f(0)=|v|.
$$

This exists for every direction, but it is not linear in $v$, so it is not a Fréchet derivative.

Forward-mode AD approximates directional behavior along executed traces. This partly explains why AD can remain useful even for non-smooth programs.

## Almost-Everywhere Differentiability

Many non-smooth functions are differentiable almost everywhere.

ReLU is differentiable everywhere except at $0$.

Max pooling is differentiable except at ties.

Sorting is differentiable except where equal elements swap order.

This matters in optimization because random floating-point inputs almost never land exactly on a measure-zero boundary.

Hence gradient descent often works well despite theoretical non-smoothness.

However, this heuristic fails when:

- boundaries are intentionally targeted
- quantization dominates behavior
- constraints are active
- combinatorial structure matters
- gradients repeatedly vanish
- discrete routing controls execution

## Vanishing Gradients from Non-Smoothness

Some non-smooth operations destroy gradient flow.

Example:

```text
y = round(x)
```

Away from integer boundaries:

$$
\frac{dy}{dx}=0.
$$

Gradient-based optimization cannot meaningfully adjust `x` because the local derivative is zero almost everywhere.

The same issue occurs for:

```text
argmax
top-k selection
thresholding
binary masks
integer indexing
```

The optimization signal disappears because small perturbations do not change the discrete outcome.

## Straight-Through Estimators

A common engineering workaround is the straight-through estimator (STE).

Forward pass:

```text
y = round(x)
```

Backward pass:

```text
dy/dx = 1
```

This is intentionally inconsistent with the true derivative.

The primal computation is discrete, but the backward pass pretends the operation were identity-like.

This technique is widely used in:

- quantized neural networks
- binary networks
- discrete latent variables
- routing systems

Mathematically, STE defines a surrogate gradient rather than a true derivative.

## Soft Relaxations

Another strategy replaces hard non-smooth operations with smooth approximations.

Instead of:

$$
\max(x_1,x_2),
$$

use:

$$
\operatorname{softmax}_\tau(x_1,x_2) =
\tau\log(e^{x_1/\tau}+e^{x_2/\tau}).
$$

As $\tau\to0$, this approaches the hard maximum.

Similarly:

| Hard operation | Smooth relaxation |
|---|---|
| ReLU | Softplus |
| Argmax | Softmax |
| Step function | Sigmoid |
| Top-k | Differentiable sorting |
| Binary masks | Gumbel-softmax |

Relaxations improve gradient flow but change the model.

## Non-Smoothness in Optimization

Gradient methods on non-smooth functions behave differently from smooth optimization.

For smooth optimization:

$$
x_{k+1}=x_k-\eta\nabla f(x_k)
$$

uses the local gradient direction.

For non-smooth optimization, one often uses:

$$
x_{k+1}=x_k-\eta g_k
$$

where:

$$
g_k\in\partial f(x_k).
$$

This is subgradient descent.

Subgradient methods converge more slowly and lack many smooth optimization guarantees.

Nevertheless, they remain practical for convex non-smooth objectives such as:

- L1 regularization
- hinge loss
- total variation
- sparse optimization

## Non-Smoothness in Physics and Geometry

Physical systems often contain contact events:

```text
if penetration:
    apply_impulse()
```

The dynamics change discontinuously when contact begins or ends.

Rigid-body simulators, collision systems, and constraint solvers are therefore naturally non-smooth.

Differentiating such systems is difficult because:

- contact topology changes
- active constraints change
- impulses create discontinuities
- solver iterations may bifurcate

Modern differentiable simulators often use:

- softened contacts
- differentiable penalty forces
- smooth collision approximations
- implicit differentiation
- custom adjoints

## Non-Smoothness in Sparse and Discrete Systems

Sparse systems often use discrete structure:

```text
indices = topk(scores)
y = gather(values, indices)
```

The selected sparsity pattern changes discontinuously.

This creates unstable gradients near selection boundaries.

Graph algorithms, retrieval systems, search systems, routing networks, and sparse mixture-of-experts models all encounter this issue.

Many such systems use hybrid approaches:

- differentiate through scores
- stop gradients through indices
- use soft routing during training
- use hard routing during inference

## Detecting Non-Smooth Regions

AD systems generally do not detect non-smoothness automatically.

Possible indicators include:

```text
NaNs
gradient spikes
zero gradients
unstable optimization
branch oscillation
sensitivity to tiny perturbations
```

Numerical finite differences may also become unstable near boundaries:

$$
\frac{f(x+h)-f(x)}{h}
$$

can vary dramatically with step size and direction.

This is not an AD failure. The underlying mathematical derivative is unstable or undefined.

## Generalized Differentiation Frameworks

Several mathematical frameworks extend classical differentiation:

| Framework | Purpose |
|---|---|
| Subgradients | Convex non-smooth functions |
| Clarke gradients | Locally Lipschitz functions |
| Viscosity solutions | PDEs with shocks |
| Distributional derivatives | Weak derivatives |
| Bouligand derivatives | Directional generalized derivatives |

Most practical AD systems do not explicitly implement these theories. They use local propagation rules combined with conventions.

Nevertheless, these frameworks explain why AD often remains useful on non-smooth programs.

## Correctness Rule

For non-smooth programs:

```text
Away from non-smooth boundaries:
    AD behaves like ordinary differentiation.

At boundaries:
    the classical derivative may not exist.

AD may return:
    a branch derivative,
    a subgradient convention,
    or a surrogate gradient.

Discrete operations:
    usually block gradient flow unless relaxed or overridden.
```

Non-smoothness therefore does not prevent automatic differentiation from running. It changes the meaning of the gradients being computed.

