Automatic differentiation transforms programs. A fundamental semantic question therefore arises:
Automatic differentiation transforms programs. A fundamental semantic question therefore arises:
When do two programs have the same derivative behavior?
Program equivalence studies when two computations should be considered identical. In differentiable systems this becomes more subtle because:
- two programs may compute the same values but different derivatives,
- two programs may produce equivalent derivatives through different computational structures,
- floating-point behavior may preserve primal equivalence while changing gradient behavior,
- mutation and control flow may affect differentiability.
Equivalence matters because AD systems routinely transform programs through:
- optimization,
- graph rewriting,
- fusion,
- common subexpression elimination,
- operator reassociation,
- checkpointing,
- staging,
- partial evaluation.
If these transformations change derivatives incorrectly, the AD system becomes unsound.
This section studies semantic equivalence for differentiable programs.
Ordinary Functional Equivalence
Two functions are extensionally equivalent if they produce the same outputs for all inputs:
Example:
These programs compute identical functions.
Their derivatives are also identical:
Ordinary mathematics assumes this property automatically:
- equivalent functions have equivalent derivatives.
For real differentiable functions this is true.
For programs, complications appear.
Computational Structure Matters
AD differentiates computations, not merely denotational functions.
Suppose:
y = x*x
z = y+yand:
z = 2*x*xMathematically they are equivalent.
Operationally they differ:
- different computational graphs,
- different intermediate values,
- different memory behavior,
- different floating-point evaluation order.
An AD system should ideally produce equivalent derivatives.
But floating-point arithmetic complicates this.
Floating-Point Non-Equivalence
Floating-point arithmetic is not associative:
Suppose:
y = (a+b)+cversus:
y = a+(b+c)The primal outputs may differ slightly.
Their derivatives may also differ numerically because:
- rounding occurs differently,
- intermediate magnitudes change,
- cancellation patterns differ.
Thus algebraic rewrites that preserve real semantics may alter floating-point derivatives.
This distinction is important in:
- compiler optimization,
- kernel fusion,
- graph rewriting,
- distributed reduction operations.
AD correctness therefore depends on the semantic level:
- exact real semantics,
- floating-point operational semantics.
Extensional vs Intensional Equivalence
Two notions of equivalence become important.
Extensional Equivalence
Programs compute the same function:
Intensional Equivalence
Programs have equivalent computational structure.
AD systems often operate intensionally because derivatives are extracted from execution structure.
This creates tension:
- mathematically equivalent programs,
- operationally different derivatives.
Compiler optimizations must preserve the relevant notion of equivalence.
Differential Equivalence
Differentiable programs require stronger equivalence.
Two programs are differentially equivalent if:
- their primal outputs agree,
- their derivatives agree.
Formally:
and
For smooth real functions, extensional equivalence implies differential equivalence.
For programs this may fail because:
- undefined regions,
- control flow,
- floating-point behavior,
- mutation,
- nondifferentiable operations.
Example:
These agree extensionally.
At , differentiability semantics become subtle depending on representation and evaluation strategy.
Control Flow Equivalence
Conditionals complicate equivalence.
Consider:
if x > 0:
y = x*x
else:
y = 0versus:
y = max(x,0)*xThese may compute the same values.
Their derivatives differ at branch boundaries because:
- one representation exposes explicit discontinuity,
- another uses piecewise algebraic structure.
AD systems operating structurally may produce different subgradient behavior.
This matters heavily in machine learning:
- ReLU activations,
- max pooling,
- clipping,
- sparse routing,
- conditional execution.
Program equivalence becomes tied to nonsmooth semantics.
Equivalence Under AD Transformation
An AD transformation:
should preserve equivalence.
If:
then ideally:
This property is called semantic preservation.
Compiler-based AD systems depend critically on it.
Without semantic preservation:
- optimization becomes unsafe,
- graph rewrites corrupt gradients,
- staging changes derivatives,
- differentiation order matters incorrectly.
Formal verification of AD systems largely studies this property.
Beta and Eta Equivalence
In lambda calculus, equivalence includes:
- beta equivalence,
- eta equivalence.
Beta Reduction
Eta Reduction
Differentiation must remain compatible with these transformations.
Suppose:
Then:
must behave consistently under beta reduction.
Higher-order differentiable languages therefore require:
- substitution-preserving differentiation,
- compositional equivalence laws.
Referential Transparency
Pure functional programs satisfy referential transparency:
An expression can be replaced by its value without changing behavior.
Example:
let y = x*x
y+ycan be rewritten:
(x*x)+(x*x)without changing mathematical meaning.
In differentiable systems, referential transparency is extremely valuable because:
- graph transformations remain valid,
- sharing can be preserved,
- gradients remain compositional.
Mutation breaks this property.
Mutation and Equivalence
Suppose:
x = x + 1The meaning depends on program state.
Now consider:
x = x + 1
y = x*xversus:
y = (x+1)*(x+1)These may appear equivalent.
Under mutation semantics:
- aliasing,
- evaluation timing,
- memory reuse,
can alter behavior.
Reverse mode becomes particularly fragile because the backward pass depends on earlier program states.
AD systems therefore often transform imperative programs into:
- SSA form,
- functional IR,
- immutable graph structure.
This restores compositional equivalence properties.
Common Subexpression Elimination
Compiler optimizations often merge repeated computations.
Example:
a = expensive(x)
y = a+aversus:
y = expensive(x)+expensive(x)Mathematically:
Operationally:
- one evaluates once,
- one evaluates twice.
If:
- floating-point behavior,
- randomness,
- side effects,
- mutation,
exist, derivatives may differ.
AD correctness therefore depends on whether the language semantics permit such rewrites.
Pure functional semantics greatly simplify this problem.
Equivalence of Computational Graphs
A computational graph represents dependency structure.
Different graphs may represent equivalent functions.
Example:
x -> square -> addversus:
x -> multiply-by-2 -> squareGraph optimizers rewrite graphs aggressively:
- node fusion,
- algebraic simplification,
- constant folding,
- operator reassociation.
An AD system must ensure:
- primal equivalence,
- differential equivalence.
Graph rewriting systems therefore often attach formal rewrite rules with derivative-preservation guarantees.
Equivalence and Reverse Mode
Reverse mode introduces additional semantic constraints.
The backward pass depends on:
- execution order,
- intermediate values,
- tape contents,
- graph structure.
Two equivalent forward programs may produce:
- different tapes,
- different memory usage,
- different adjoint accumulation order.
Floating-point non-associativity may then produce numerically different gradients.
This issue appears prominently in:
- distributed training,
- parallel reductions,
- asynchronous gradient aggregation.
Program equivalence becomes probabilistic or approximate rather than exact.
Observational Equivalence
A practical notion is observational equivalence.
Two differentiable programs are observationally equivalent if no external observer can distinguish them through:
- outputs,
- derivatives,
- higher-order derivative queries.
This depends on the allowed observations.
Examples:
- only primal outputs,
- gradients,
- Hessians,
- runtime behavior,
- memory consumption.
Different AD systems preserve different observational equivalences.
Equivalence and Compiler Correctness
AD compilers apply many transformations:
- fusion,
- staging,
- differentiation,
- simplification,
- dead-code elimination,
- checkpoint insertion.
Compiler correctness requires proving:
- transformed program equivalence,
- derivative equivalence,
- higher-order equivalence.
Formal methods increasingly target this problem because:
- large ML systems rely on compiler transformations,
- incorrect gradients silently corrupt optimization,
- floating-point behavior complicates reasoning.
Several modern differentiable compilers use:
- theorem proving,
- typed IRs,
- verified rewrite systems.
Equivalence Classes of Differentiable Programs
Programs may be grouped by:
- primal equivalence,
- differential equivalence,
- higher-order equivalence.
For smooth programs over exact reals these often coincide.
For real systems they diverge because of:
- finite precision,
- mutation,
- nondifferentiability,
- control flow,
- randomness,
- concurrency.
Understanding these distinctions is essential for building reliable AD infrastructure.
The Core Semantic Problem
Automatic differentiation transforms computations structurally.
Program equivalence therefore becomes a semantic foundation of AD correctness.
The central challenge is:
Which program transformations preserve derivatives?
Pure functional structure simplifies equivalence because:
- substitution behaves compositionally,
- sharing remains explicit,
- mutation disappears.
Imperative structure complicates equivalence because:
- execution order matters,
- memory state evolves,
- floating-point behavior changes.
Differentiable programming systems therefore increasingly rely on:
- immutable IRs,
- SSA representations,
- compositional semantics,
- formally verified rewrites.
The correctness of automatic differentiation ultimately depends on preserving equivalence not only of values, but of derivative structure itself.