Automatic differentiation reduces differentiation to a finite collection of elementary operations. Every program, regardless of complexity, is decomposed into primitive...
Automatic differentiation reduces differentiation to a finite collection of elementary operations. Every program, regardless of complexity, is decomposed into primitive computational steps with known local derivative rules.
An AD system therefore requires two components:
- a representation of computation as primitive operations
- derivative propagation rules for each primitive
This section formalizes these elementary operations and shows how derivative rules are attached to them.
Primitive Operations
A primitive operation is an operation whose derivative behavior is directly known.
Typical primitives include:
- arithmetic operations
- transcendental functions
- tensor primitives
- control primitives
- linear algebra kernels
Examples:
| Category | Operations |
|---|---|
| Arithmetic | |
| Power | |
| Exponential | |
| Trigonometric | |
| Hyperbolic | |
| Comparison | |
| Tensor | reshape, transpose, broadcast |
| Linear algebra | matmul, solve, svd |
Complex functions are compositions of these primitives.
The Local Differentiation Principle
Each primitive operation defines:
- output values
- local derivative transformations
Suppose:
The operation defines a local Jacobian:
AD systems propagate derivatives through compositions of these local Jacobians.
The global derivative is never derived symbolically.
Unary Operations
Unary operations map one input to one output.
Negation
Derivative:
Forward propagation:
Reverse propagation:
Reciprocal
Derivative:
Forward:
Reverse:
Square Root
Derivative:
Forward:
Reverse:
Binary Operations
Binary operations combine two inputs.
Addition
Local derivatives:
Forward:
Reverse:
Subtraction
Forward:
Reverse:
Multiplication
Local derivatives:
Forward:
Reverse:
Division
Local derivatives:
Forward:
Reverse:
Exponential and Logarithmic Operations
Exponential
Derivative:
Forward:
Reverse:
Natural Logarithm
Derivative:
Forward:
Reverse:
Trigonometric Operations
Sine
Derivative:
Forward:
Reverse:
Cosine
Derivative:
Forward:
Reverse:
Tangent
Derivative:
Forward:
Reverse:
Power Operations
Constant Exponent
Derivative:
Forward:
Reverse:
Variable Exponent
This operation depends on both variables.
Derivative identities:
Forward:
Reverse:
Vector-Valued Operations
Primitive operations may produce vectors or tensors.
Example:
where:
Derivative:
Forward:
Reverse:
Reverse mode naturally introduces transposed operators.
This is one reason reverse mode aligns well with linear algebra systems.
Tensor Operations
Modern AD systems require derivatives for tensor primitives.
Broadcast
Broadcasting conceptually replicates dimensions.
Reverse propagation must reduce along broadcast axes.
Example:
Backward propagation accumulates gradients:
Reshape
Reshape changes layout without changing values.
Derivative propagation only changes metadata.
No arithmetic transformation occurs.
Transpose
Transpose reverses axes.
Backward rule:
Linear Algebra Kernels
Efficient AD systems treat matrix operations as primitives.
Matrix Multiplication
Forward:
Reverse:
These rules are fundamental in neural network training.
Local Derivative Tables
AD implementations often store derivative rules in dispatch tables.
Conceptually:
| Operation | Forward Rule | Reverse Rule |
|---|---|---|
| add | distribute | |
| mul | product rule | weighted accumulation |
| exp | multiply by output | |
| sin | multiply by cosine |
The runtime evaluates:
- primal computation
- local derivative rule
- propagation
This separation allows extensibility.
Primitive Sets and Closure
An AD system only differentiates operations whose local rules are defined.
If every operation in a program belongs to the primitive set, then the entire program becomes differentiable under composition.
This closure property is central:
- differentiable programs are built compositionally
- local rules induce global derivatives
The entire AD framework depends on this compositional closure.
Numerical Stability of Primitive Rules
Derivative rules may amplify numerical instability.
Example:
Near zero:
- gradients explode
- floating point error increases
Similarly:
becomes unstable near zero.
AD systems therefore require:
- stable primitive implementations
- numerically safe kernels
- domain checks
- fused operations
Modern deep learning systems often implement stabilized primitives directly:
- logsumexp
- softmax-crossentropy fusion
- stable normalization kernels
Primitive Operations as the Foundation of AD
Automatic differentiation does not differentiate arbitrary mathematics directly.
It differentiates programs composed from primitives.
Every AD engine therefore rests on:
- a computational graph
- a primitive operator set
- local derivative propagation rules
All higher abstractions ultimately reduce to these elementary operations.