# Numerical Differentiation

## Numerical Differentiation

Numerical differentiation estimates derivatives by evaluating a function at nearby input values. It treats the function as a black box. The method does not need access to the internal structure of the program, only the ability to call it.

For a scalar function

$$
f : \mathbb{R} \to \mathbb{R},
$$

the simplest finite difference approximation is

$$
f'(x) \approx \frac{f(x+h)-f(x)}{h}.
$$

Here $h$ is a small perturbation. The idea is geometric: replace the tangent slope at $x$ with the slope of a nearby secant line.

This method is called the forward difference formula. It is easy to implement:

```text
function derivative(f, x, h):
    return (f(x + h) - f(x)) / h
```

The formula looks harmless, but the choice of $h$ is a serious numerical problem. If $h$ is too large, the approximation ignores curvature. If $h$ is too small, floating point cancellation destroys useful digits.

## Forward Difference

Assume $f$ is smooth near $x$. Taylor expansion gives

$$
f(x+h) =
f(x)
+
h f'(x)
+
\frac{h^2}{2} f''(x)
+
O(h^3).
$$

Subtracting $f(x)$ and dividing by $h$ gives

$$
\frac{f(x+h)-f(x)}{h} =
f'(x)
+
\frac{h}{2} f''(x)
+
O(h^2).
$$

So the forward difference formula has truncation error of order $O(h)$. Reducing $h$ reduces this error linearly, at least in exact arithmetic.

Floating point arithmetic changes the picture. When $h$ is very small, $f(x+h)$ and $f(x)$ are nearly equal. Their subtraction loses significant digits. Dividing by $h$ then amplifies the error.

The total error has two competing parts:

$$
\text{error}
\approx
C_1 h
+
C_2 \frac{\epsilon}{h},
$$

where $\epsilon$ is machine precision. The first term decreases with $h$. The second term increases with $h^{-1}$. There is no universally safe choice.

## Central Difference

A more accurate formula evaluates the function on both sides of $x$:

$$
f'(x)
\approx
\frac{f(x+h)-f(x-h)}{2h}.
$$

Taylor expansion gives

$$
f(x+h) =
f(x)
+
h f'(x)
+
\frac{h^2}{2}f''(x)
+
\frac{h^3}{6}f'''(x)
+
O(h^4),
$$

and

$$
f(x-h) =
f(x) -
h f'(x)
+
\frac{h^2}{2}f''(x) -
\frac{h^3}{6}f'''(x)
+
O(h^4).
$$

Subtracting cancels the even-order terms:

$$
\frac{f(x+h)-f(x-h)}{2h} =
f'(x)
+
\frac{h^2}{6}f'''(x)
+
O(h^4).
$$

Central difference has truncation error $O(h^2)$, better than forward difference. But it requires two function evaluations instead of one extra evaluation, and it still suffers from roundoff when $h$ is too small.

A typical implementation is:

```text
function derivative(f, x, h):
    return (f(x + h) - f(x - h)) / (2 * h)
```

Central difference is often a better diagnostic tool than forward difference, but it remains an approximation.

## Multivariate Finite Differences

For a function

$$
f : \mathbb{R}^n \to \mathbb{R},
$$

the gradient is

$$
\nabla f(x) =
\begin{bmatrix}
\frac{\partial f}{\partial x_1} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{bmatrix}.
$$

A finite difference estimate perturbs one coordinate at a time:

$$
\frac{\partial f}{\partial x_i}
\approx
\frac{f(x + h e_i)-f(x)}{h},
$$

where $e_i$ is the $i$-th coordinate vector.

This requires one base evaluation plus one evaluation per input coordinate. The cost is

$$
n + 1
$$

function evaluations for a forward-difference gradient.

For central differences, the cost is

$$
2n
$$

function evaluations.

This scaling is acceptable when $n$ is small. It becomes impractical when $n$ is large. A neural network with one million parameters would require roughly one million function evaluations to estimate a single gradient by forward differences. Reverse mode automatic differentiation can compute the same gradient with cost comparable to a small constant multiple of one function evaluation.

## Jacobian Approximation

For a vector-valued function

$$
f : \mathbb{R}^n \to \mathbb{R}^m,
$$

the Jacobian is

$$
J_f(x)
\in
\mathbb{R}^{m \times n}.
$$

Finite differences approximate one column at a time:

$$
J_f(x)_{:,i}
\approx
\frac{f(x+h e_i)-f(x)}{h}.
$$

Each perturbation gives the effect of one input coordinate on all output coordinates. Therefore a full forward-difference Jacobian requires $n+1$ function evaluations, regardless of $m$.

This is useful when $n$ is modest and each function evaluation is cheap. It is poor when the input dimension is large or the function evaluation is expensive.

## Choosing the Step Size

The step size $h$ should be small enough to approximate local behavior and large enough to avoid cancellation. A common heuristic for forward differences in double precision is

$$
h \approx \sqrt{\epsilon}(1 + |x|),
$$

where $\epsilon$ is machine precision.

For central differences, a common heuristic is

$$
h \approx \epsilon^{1/3}(1 + |x|).
$$

These are only heuristics. The right scale depends on the function, the input magnitude, the curvature, and the floating point behavior of the computation.

A single global $h$ is often wrong for vector inputs. Different variables may have different units and scales. Perturbing a temperature, a probability, a mass, and a neural network weight by the same absolute amount can produce meaningless results.

A more practical coordinatewise rule is

$$
h_i = \eta(1 + |x_i|),
$$

where $\eta$ is a small dimensionless parameter.

Even this rule can fail when the function has thresholds, discontinuities, saturating nonlinearities, stochastic components, or internal iterative solvers.

## Cancellation

The core numerical failure in finite differences is cancellation.

Suppose $f(x+h)$ and $f(x)$ agree in many leading digits. Their difference may contain only a few reliable digits. For example, if two double precision numbers are close, subtracting them can erase most of the significant information.

The derivative estimate then divides this small difference by $h$, amplifying any roundoff error.

This failure becomes severe when the true derivative is small, the function value is large, or the perturbation is below the scale at which the program responds numerically.

Automatic differentiation avoids this specific subtraction problem because it propagates derivative values through primitive operations directly. It computes local derivative information during evaluation rather than subtracting nearly equal function values after evaluation.

## Discontinuities and Non-Smooth Points

Finite differences can also mislead near non-smooth points.

Consider

$$
f(x) = |x|.
$$

At $x=0$, the derivative does not exist. But the central difference gives

$$
\frac{|h| - |-h|}{2h} = 0.
$$

This number may look like a derivative, but it is an artifact of the symmetric formula.

The forward difference gives

$$
\frac{|h|-0}{h}=1
$$

for positive $h$, while a backward difference gives $-1$. Different formulas produce different answers because the mathematical derivative is undefined.

Automatic differentiation does not magically fix non-smoothness. It follows the derivative rule of the executed branch or primitive operation. The key difference is that AD makes the computational derivative explicit, while finite differences approximate behavior around the point.

## Stochastic and Stateful Functions

Finite differences assume repeated function evaluations are comparable. This assumption fails when the function is stochastic or stateful.

For example:

```text
function f(x):
    noise = random_normal()
    return x * x + noise
```

The finite difference

```text
(f(x + h) - f(x)) / h
```

subtracts two different noise samples unless randomness is controlled. The derivative estimate may be dominated by noise.

State creates a similar problem:

```text
function f(x):
    counter = counter + 1
    return x * counter
```

Two evaluations do not measure the same mathematical function. They measure different states of a computation.

Automatic differentiation also requires discipline around randomness and state, but it differentiates a single execution trace. This makes the derivative meaning clearer: the derivative belongs to the computation that actually ran.

## Finite Differences as a Testing Tool

Despite its weaknesses, numerical differentiation remains useful. Its best role is testing.

When implementing automatic differentiation, we often compare AD results against finite difference estimates for small examples. This practice is called gradient checking.

A typical test checks whether

$$
\frac{f(x+h v)-f(x)}{h}
$$

is close to

$$
\nabla f(x)^T v,
$$

where $v$ is a chosen direction.

This directional test is cheaper than checking every coordinate. It uses one perturbation direction instead of $n$ coordinate perturbations.

For vector-valued functions, a similar check compares

$$
\frac{f(x+h v)-f(x)}{h}
$$

with

$$
J_f(x)v.
$$

This verifies a Jacobian-vector product without materializing the full Jacobian.

Finite differences are therefore useful as an independent sanity check. They are usually unsuitable as the main derivative engine for large systems.

## Why Numerical Differentiation Is Not Enough

Numerical differentiation has three structural limits.

It is approximate. The result depends on the perturbation size, floating point precision, and local smoothness.

It scales poorly. Full gradients and Jacobians require many function evaluations when the input dimension is large.

It treats programs as black boxes. It ignores the internal chain of operations, so it cannot exploit the structure that makes derivative computation efficient.

Automatic differentiation addresses these limits by opening the black box. It records or transforms the computation, applies exact local derivative rules, and combines them through the chain rule.

Finite differences ask, “What happens if I rerun the program nearby?”

Automatic differentiation asks, “How does change flow through this execution?”

That difference is the reason automatic differentiation is the standard tool for derivative computation in modern numerical software.

