Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full...
Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full symbolic expression, or approximating derivatives numerically, AD propagates derivative information through each elementary operation in a program.
This section develops the central mechanism underlying all forms of automatic differentiation: local derivative propagation.
Programs as Compositions
Consider a function:
This function is not evaluated as a single indivisible object. A program computes it as a sequence of intermediate operations:
| Step | Expression | Variable |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 |
The final result is:
Each operation has a local derivative rule:
| Operation | Local Derivative |
|---|---|
Automatic differentiation applies these rules incrementally across the computation graph.
The Chain Rule as Propagation
The chain rule connects local derivatives into global derivatives.
For nested functions:
the derivative becomes:
Automatic differentiation operationalizes this rule mechanically.
Instead of reasoning symbolically about the full expression, the system computes:
- local values
- local derivatives
- propagated sensitivities
This decomposition is the key idea.
Computational Graph Representation
A program can be represented as a directed acyclic graph (DAG).
Nodes represent operations or variables.
Edges represent data dependencies.
For:
the graph contains:
- input nodes
- multiplication node
- sine node
- addition node
Derivative information flows along graph edges.
The graph separates:
- primal computation
- derivative propagation
This separation allows AD systems to transform programs systematically.
Local Jacobians
Each operation defines a small Jacobian relating outputs to inputs.
For scalar operations, these are ordinary derivatives.
For vector operations, they become matrices.
Example:
Local derivatives:
The local Jacobian is:
AD systems never need to materialize the full global Jacobian explicitly. Instead, they propagate products involving local Jacobians.
This distinction is critical for scalability.
Forward Propagation of Derivatives
In forward mode, derivatives move in the same direction as computation.
Suppose:
Then:
The dot notation denotes a tangent value.
Every variable carries:
- its primal value
- its derivative value
Evaluation proceeds step-by-step.
Example:
with .
Primal computation:
| Variable | Value |
|---|---|
| 3 | |
| 9 | |
Derivative propagation:
| Variable | Derivative |
|---|---|
| 1 | |
Final derivative:
No symbolic algebra is required.
Reverse Propagation of Derivatives
Reverse mode propagates sensitivities backward.
Instead of computing:
it computes:
These quantities are called adjoints.
Suppose:
Forward pass:
- compute intermediate values
- store them
Reverse pass:
- propagate gradients backward
Intermediate variables:
Backward propagation:
Final result:
Reverse mode is especially efficient when:
- outputs are few
- inputs are many
This property explains its dominance in deep learning.
Propagation Rules for Elementary Operations
Each primitive operation defines propagation equations.
Addition:
Forward:
Reverse:
Multiplication:
Forward:
Reverse:
Sine:
Forward:
Reverse:
These rules form the operational core of AD engines.
Locality and Modularity
A major advantage of AD is locality.
Each operation only needs:
- local inputs
- local outputs
- local derivative rules
An operation never needs knowledge of the full program.
This locality enables:
- composability
- compiler transformations
- efficient runtime systems
- extensible operator libraries
New differentiable operations can be added independently.
Exactness and Floating Point Semantics
Automatic differentiation computes derivatives accurate up to machine precision.
This differs fundamentally from finite differences.
Finite difference approximation:
introduces:
- truncation error
- cancellation error
- sensitivity to step size
AD avoids these issues because it differentiates the computational procedure itself.
The derivative computation is exact relative to the executed floating point operations.
Complexity Perspective
Suppose a program requires operations.
Forward mode derivative computation costs approximately:
for one directional derivative.
Reverse mode computes gradients with cost proportional to a small multiple of the original computation.
This efficiency is one of the most important results in automatic differentiation theory.
For scalar-output functions:
reverse mode computes the full gradient in roughly the same asymptotic complexity as evaluating the function itself.
Local Propagation as the Core Abstraction
Everything in automatic differentiation reduces to local propagation.
Regardless of:
- programming language
- compiler strategy
- tensor system
- graph representation
- runtime architecture
the core mechanism remains unchanged:
- decompose computation into primitive operations
- associate local derivative rules
- propagate derivative information through the graph
Forward mode propagates tangents.
Reverse mode propagates adjoints.
Higher-order systems propagate richer algebraic structures.
But all of them inherit the same foundational principle: local derivative propagation through compositional programs.