Numerical differentiation estimates derivatives by evaluating a function at nearby input values. It treats the function as a black box. The method does not need access to the...
Numerical differentiation estimates derivatives by evaluating a function at nearby input values. It treats the function as a black box. The method does not need access to the internal structure of the program, only the ability to call it.
For a scalar function
the simplest finite difference approximation is
Here is a small perturbation. The idea is geometric: replace the tangent slope at with the slope of a nearby secant line.
This method is called the forward difference formula. It is easy to implement:
function derivative(f, x, h):
return (f(x + h) - f(x)) / hThe formula looks harmless, but the choice of is a serious numerical problem. If is too large, the approximation ignores curvature. If is too small, floating point cancellation destroys useful digits.
Forward Difference
Assume is smooth near . Taylor expansion gives
Subtracting and dividing by gives
So the forward difference formula has truncation error of order . Reducing reduces this error linearly, at least in exact arithmetic.
Floating point arithmetic changes the picture. When is very small, and are nearly equal. Their subtraction loses significant digits. Dividing by then amplifies the error.
The total error has two competing parts:
where is machine precision. The first term decreases with . The second term increases with . There is no universally safe choice.
Central Difference
A more accurate formula evaluates the function on both sides of :
Taylor expansion gives
and
Subtracting cancels the even-order terms:
Central difference has truncation error , better than forward difference. But it requires two function evaluations instead of one extra evaluation, and it still suffers from roundoff when is too small.
A typical implementation is:
function derivative(f, x, h):
return (f(x + h) - f(x - h)) / (2 * h)Central difference is often a better diagnostic tool than forward difference, but it remains an approximation.
Multivariate Finite Differences
For a function
the gradient is
A finite difference estimate perturbs one coordinate at a time:
where is the -th coordinate vector.
This requires one base evaluation plus one evaluation per input coordinate. The cost is
function evaluations for a forward-difference gradient.
For central differences, the cost is
function evaluations.
This scaling is acceptable when is small. It becomes impractical when is large. A neural network with one million parameters would require roughly one million function evaluations to estimate a single gradient by forward differences. Reverse mode automatic differentiation can compute the same gradient with cost comparable to a small constant multiple of one function evaluation.
Jacobian Approximation
For a vector-valued function
the Jacobian is
Finite differences approximate one column at a time:
Each perturbation gives the effect of one input coordinate on all output coordinates. Therefore a full forward-difference Jacobian requires function evaluations, regardless of .
This is useful when is modest and each function evaluation is cheap. It is poor when the input dimension is large or the function evaluation is expensive.
Choosing the Step Size
The step size should be small enough to approximate local behavior and large enough to avoid cancellation. A common heuristic for forward differences in double precision is
where is machine precision.
For central differences, a common heuristic is
These are only heuristics. The right scale depends on the function, the input magnitude, the curvature, and the floating point behavior of the computation.
A single global is often wrong for vector inputs. Different variables may have different units and scales. Perturbing a temperature, a probability, a mass, and a neural network weight by the same absolute amount can produce meaningless results.
A more practical coordinatewise rule is
where is a small dimensionless parameter.
Even this rule can fail when the function has thresholds, discontinuities, saturating nonlinearities, stochastic components, or internal iterative solvers.
Cancellation
The core numerical failure in finite differences is cancellation.
Suppose and agree in many leading digits. Their difference may contain only a few reliable digits. For example, if two double precision numbers are close, subtracting them can erase most of the significant information.
The derivative estimate then divides this small difference by , amplifying any roundoff error.
This failure becomes severe when the true derivative is small, the function value is large, or the perturbation is below the scale at which the program responds numerically.
Automatic differentiation avoids this specific subtraction problem because it propagates derivative values through primitive operations directly. It computes local derivative information during evaluation rather than subtracting nearly equal function values after evaluation.
Discontinuities and Non-Smooth Points
Finite differences can also mislead near non-smooth points.
Consider
At , the derivative does not exist. But the central difference gives
This number may look like a derivative, but it is an artifact of the symmetric formula.
The forward difference gives
for positive , while a backward difference gives . Different formulas produce different answers because the mathematical derivative is undefined.
Automatic differentiation does not magically fix non-smoothness. It follows the derivative rule of the executed branch or primitive operation. The key difference is that AD makes the computational derivative explicit, while finite differences approximate behavior around the point.
Stochastic and Stateful Functions
Finite differences assume repeated function evaluations are comparable. This assumption fails when the function is stochastic or stateful.
For example:
function f(x):
noise = random_normal()
return x * x + noiseThe finite difference
(f(x + h) - f(x)) / hsubtracts two different noise samples unless randomness is controlled. The derivative estimate may be dominated by noise.
State creates a similar problem:
function f(x):
counter = counter + 1
return x * counterTwo evaluations do not measure the same mathematical function. They measure different states of a computation.
Automatic differentiation also requires discipline around randomness and state, but it differentiates a single execution trace. This makes the derivative meaning clearer: the derivative belongs to the computation that actually ran.
Finite Differences as a Testing Tool
Despite its weaknesses, numerical differentiation remains useful. Its best role is testing.
When implementing automatic differentiation, we often compare AD results against finite difference estimates for small examples. This practice is called gradient checking.
A typical test checks whether
is close to
where is a chosen direction.
This directional test is cheaper than checking every coordinate. It uses one perturbation direction instead of coordinate perturbations.
For vector-valued functions, a similar check compares
with
This verifies a Jacobian-vector product without materializing the full Jacobian.
Finite differences are therefore useful as an independent sanity check. They are usually unsuitable as the main derivative engine for large systems.
Why Numerical Differentiation Is Not Enough
Numerical differentiation has three structural limits.
It is approximate. The result depends on the perturbation size, floating point precision, and local smoothness.
It scales poorly. Full gradients and Jacobians require many function evaluations when the input dimension is large.
It treats programs as black boxes. It ignores the internal chain of operations, so it cannot exploit the structure that makes derivative computation efficient.
Automatic differentiation addresses these limits by opening the black box. It records or transforms the computation, applies exact local derivative rules, and combines them through the chain rule.
Finite differences ask, “What happens if I rerun the program nearby?”
Automatic differentiation asks, “How does change flow through this execution?”
That difference is the reason automatic differentiation is the standard tool for derivative computation in modern numerical software.