# Piecewise Differentiability

## Piecewise Differentiability

A piecewise differentiable function is built from several differentiable pieces joined by boundaries. Each piece has an ordinary derivative inside its region. At the boundaries, the derivative may agree, disagree, or fail to exist.

Conditionals, loops with data-dependent exits, clipping, thresholding, sorting, indexing, activation functions, and many numerical guards all produce piecewise differentiable programs.

A typical example is:

```text
def f(x):
    if x < 0:
        return -x
    else:
        return x
```

This computes

$$
f(x)=|x|.
$$

For $x < 0$,

$$
f'(x)=-1.
$$

For $x > 0$,

$$
f'(x)=1.
$$

At $x=0$, the classical derivative does not exist.

Automatic differentiation follows the executed piece. At $x=0$, the returned derivative depends on the program branch, not on the full mathematical shape of the function.

## Smooth Regions

A piecewise function partitions its domain into regions:

$$
R_1, R_2, \ldots, R_k.
$$

Inside each region, the function is smooth:

$$
f(x)=f_i(x)\quad \text{for }x\in R_i.
$$

AD works normally inside a region. The branch decisions are locally constant, so the program behaves like straight-line differentiable code.

For example:

```text
def relu(x):
    if x > 0:
        return x
    else:
        return 0
```

For $x > 0$, the derivative is $1$. For $x < 0$, the derivative is $0$. These are ordinary derivatives of the active region.

## Boundaries

A boundary is a set of inputs where the active piece may change. For ReLU, the boundary is $x=0$. For a multidimensional condition such as:

```text
if x[0] + x[1] > 1:
    y = f(x)
else:
    y = g(x)
```

the boundary is the hyperplane:

$$
x_0+x_1=1.
$$

At a boundary, the full function may have no derivative. Even when it is continuous, the directional derivative may depend on the direction of approach.

AD usually returns the derivative of whichever branch the program selected. This value can be useful, but it should not be confused with a proof of differentiability.

## Subgradients

For convex non-smooth functions, one may use subgradients. The absolute value function has subdifferential:

$$
\partial |x| =
\begin{cases}
\{-1\}, & x < 0,\\
[-1,1], & x = 0,\\
\{1\}, & x > 0.
\end{cases}
$$

At $x=0$, any value in $[-1,1]$ is a valid subgradient.

Many machine learning systems define a conventional gradient at non-smooth points. For ReLU, the derivative at $0$ is often defined as $0$. This is a convention, not a classical derivative.

```text
relu'(0) = 0   # common implementation convention
```

Such conventions are practical because optimization algorithms need some value to propagate. They do not remove the underlying non-smoothness.

## Piecewise Linear Functions

Piecewise linear functions are common in machine learning. Examples include:

```text
relu(x)      = max(0, x)
leaky_relu(x)= max(alpha*x, x)
clip(x,a,b)  = min(max(x,a), b)
abs(x)       = max(x, -x)
```

Inside each region, the derivative is constant or affine.

For `clip`:

$$
\operatorname{clip}(x,a,b)=
\begin{cases}
a, & x<a,\\
x, & a\le x\le b,\\
b, & x>b.
\end{cases}
$$

Away from boundaries:

$$
\frac{d}{dx}\operatorname{clip}(x,a,b)=
\begin{cases}
0, & x<a,\\
1, & a<x<b,\\
0, & x>b.
\end{cases}
$$

At $x=a$ and $x=b$, the implementation must choose a convention.

## Selection Operations

Indexing, sorting, top-k, argmax, and masking are piecewise operations. They often choose a discrete structure and then propagate gradients only through selected values.

Example:

```text
y = max(x1, x2)
```

If $x_1 > x_2$, then:

$$
\frac{\partial y}{\partial x_1}=1,\qquad
\frac{\partial y}{\partial x_2}=0.
$$

If $x_2 > x_1$, then:

$$
\frac{\partial y}{\partial x_1}=0,\qquad
\frac{\partial y}{\partial x_2}=1.
$$

At $x_1=x_2$, the function is non-smooth. An implementation may send the gradient to one argument, split it, or follow a documented tie-breaking rule.

The same issue appears in `sort`. Sorting is locally a permutation. Away from ties, its Jacobian is a permutation matrix. At ties, the active permutation is unstable.

## Piecewise Programs and AD Correctness

For a program $P$, AD computes a local linearization of the executed trace. This is the correct derivative when the trace is locally stable.

A trace is locally stable when small perturbations do not change:

```text
branch choices
loop counts
recursion depth
selected indices
sort order
top-k membership
mask structure
```

When these remain fixed, AD gives the ordinary derivative of the program near the input.

When they change under arbitrarily small perturbations, the function is at a boundary. The returned derivative describes the selected trace only.

## Practical Rule

For piecewise differentiable programs:

```text
Inside a smooth region:
    AD gives the classical derivative.

At a boundary:
    AD gives a branch-specific or convention-specific value.

For discrete selections:
    gradients usually flow through selected values,
    not through the selection decision itself.
```

This is acceptable in many optimization problems because exact boundary points often have measure zero under continuous data. But this argument is probabilistic, not absolute. Boundaries matter in quantization, clipping, constraints, solvers, routing, and any model that deliberately creates hard decisions.

Piecewise differentiability is therefore a normal part of AD. The main requirement is to know whether the derivative returned by the system is a true local derivative, a subgradient convention, or merely the derivative of one executed path.