Skip to content

Program Equivalence

Automatic differentiation transforms programs. A fundamental semantic question therefore arises:

Automatic differentiation transforms programs. A fundamental semantic question therefore arises:

When do two programs have the same derivative behavior?

Program equivalence studies when two computations should be considered identical. In differentiable systems this becomes more subtle because:

  • two programs may compute the same values but different derivatives,
  • two programs may produce equivalent derivatives through different computational structures,
  • floating-point behavior may preserve primal equivalence while changing gradient behavior,
  • mutation and control flow may affect differentiability.

Equivalence matters because AD systems routinely transform programs through:

  • optimization,
  • graph rewriting,
  • fusion,
  • common subexpression elimination,
  • operator reassociation,
  • checkpointing,
  • staging,
  • partial evaluation.

If these transformations change derivatives incorrectly, the AD system becomes unsound.

This section studies semantic equivalence for differentiable programs.

Ordinary Functional Equivalence

Two functions are extensionally equivalent if they produce the same outputs for all inputs:

f(x)=g(x)x. f(x)=g(x) \quad \forall x.

Example:

f(x)=x2+2x+1 f(x)=x^2+2x+1 g(x)=(x+1)2. g(x)=(x+1)^2.

These programs compute identical functions.

Their derivatives are also identical:

f(x)=g(x)=2x+2. f'(x)=g'(x)=2x+2.

Ordinary mathematics assumes this property automatically:

  • equivalent functions have equivalent derivatives.

For real differentiable functions this is true.

For programs, complications appear.

Computational Structure Matters

AD differentiates computations, not merely denotational functions.

Suppose:

y = x*x
z = y+y

and:

z = 2*x*x

Mathematically they are equivalent.

Operationally they differ:

  • different computational graphs,
  • different intermediate values,
  • different memory behavior,
  • different floating-point evaluation order.

An AD system should ideally produce equivalent derivatives.

But floating-point arithmetic complicates this.

Floating-Point Non-Equivalence

Floating-point arithmetic is not associative:

(a+b)+ca+(b+c). (a+b)+c \ne a+(b+c).

Suppose:

y = (a+b)+c

versus:

y = a+(b+c)

The primal outputs may differ slightly.

Their derivatives may also differ numerically because:

  • rounding occurs differently,
  • intermediate magnitudes change,
  • cancellation patterns differ.

Thus algebraic rewrites that preserve real semantics may alter floating-point derivatives.

This distinction is important in:

  • compiler optimization,
  • kernel fusion,
  • graph rewriting,
  • distributed reduction operations.

AD correctness therefore depends on the semantic level:

  • exact real semantics,
  • floating-point operational semantics.

Extensional vs Intensional Equivalence

Two notions of equivalence become important.

Extensional Equivalence

Programs compute the same function:

f(x)=g(x). f(x)=g(x).

Intensional Equivalence

Programs have equivalent computational structure.

AD systems often operate intensionally because derivatives are extracted from execution structure.

This creates tension:

  • mathematically equivalent programs,
  • operationally different derivatives.

Compiler optimizations must preserve the relevant notion of equivalence.

Differential Equivalence

Differentiable programs require stronger equivalence.

Two programs are differentially equivalent if:

  • their primal outputs agree,
  • their derivatives agree.

Formally:

f(x)=g(x) f(x)=g(x)

and

Df(x)=Dg(x). Df(x)=Dg(x).

For smooth real functions, extensional equivalence implies differential equivalence.

For programs this may fail because:

  • undefined regions,
  • control flow,
  • floating-point behavior,
  • mutation,
  • nondifferentiable operations.

Example:

f(x)=xx f(x)=x|x| g(x)={x2x0x2x<0 g(x)= \begin{cases} x^2 & x\ge0 \\ -x^2 & x<0 \end{cases}

These agree extensionally.

At x=0x=0, differentiability semantics become subtle depending on representation and evaluation strategy.

Control Flow Equivalence

Conditionals complicate equivalence.

Consider:

if x > 0:
    y = x*x
else:
    y = 0

versus:

y = max(x,0)*x

These may compute the same values.

Their derivatives differ at branch boundaries because:

  • one representation exposes explicit discontinuity,
  • another uses piecewise algebraic structure.

AD systems operating structurally may produce different subgradient behavior.

This matters heavily in machine learning:

  • ReLU activations,
  • max pooling,
  • clipping,
  • sparse routing,
  • conditional execution.

Program equivalence becomes tied to nonsmooth semantics.

Equivalence Under AD Transformation

An AD transformation:

PD[P] P \mapsto D[P]

should preserve equivalence.

If:

P1P2, P_1 \equiv P_2,

then ideally:

D[P1]D[P2]. D[P_1]\equiv D[P_2].

This property is called semantic preservation.

Compiler-based AD systems depend critically on it.

Without semantic preservation:

  • optimization becomes unsafe,
  • graph rewrites corrupt gradients,
  • staging changes derivatives,
  • differentiation order matters incorrectly.

Formal verification of AD systems largely studies this property.

Beta and Eta Equivalence

In lambda calculus, equivalence includes:

  • beta equivalence,
  • eta equivalence.

Beta Reduction

(λx.t) ut[u/x]. (\lambda x.t)\ u \to t[u/x].

Eta Reduction

λx.f(x)f. \lambda x.f(x) \equiv f.

Differentiation must remain compatible with these transformations.

Suppose:

f=λx.x2. f=\lambda x.x^2.

Then:

D[λx.x2] D[\lambda x.x^2]

must behave consistently under beta reduction.

Higher-order differentiable languages therefore require:

  • substitution-preserving differentiation,
  • compositional equivalence laws.

Referential Transparency

Pure functional programs satisfy referential transparency:

An expression can be replaced by its value without changing behavior.

Example:

let y = x*x
y+y

can be rewritten:

(x*x)+(x*x)

without changing mathematical meaning.

In differentiable systems, referential transparency is extremely valuable because:

  • graph transformations remain valid,
  • sharing can be preserved,
  • gradients remain compositional.

Mutation breaks this property.

Mutation and Equivalence

Suppose:

x = x + 1

The meaning depends on program state.

Now consider:

x = x + 1
y = x*x

versus:

y = (x+1)*(x+1)

These may appear equivalent.

Under mutation semantics:

  • aliasing,
  • evaluation timing,
  • memory reuse,

can alter behavior.

Reverse mode becomes particularly fragile because the backward pass depends on earlier program states.

AD systems therefore often transform imperative programs into:

  • SSA form,
  • functional IR,
  • immutable graph structure.

This restores compositional equivalence properties.

Common Subexpression Elimination

Compiler optimizations often merge repeated computations.

Example:

a = expensive(x)
y = a+a

versus:

y = expensive(x)+expensive(x)

Mathematically:

y=2expensive(x). y=2\,\mathrm{expensive}(x).

Operationally:

  • one evaluates once,
  • one evaluates twice.

If:

  • floating-point behavior,
  • randomness,
  • side effects,
  • mutation,

exist, derivatives may differ.

AD correctness therefore depends on whether the language semantics permit such rewrites.

Pure functional semantics greatly simplify this problem.

Equivalence of Computational Graphs

A computational graph represents dependency structure.

Different graphs may represent equivalent functions.

Example:

x -> square -> add

versus:

x -> multiply-by-2 -> square

Graph optimizers rewrite graphs aggressively:

  • node fusion,
  • algebraic simplification,
  • constant folding,
  • operator reassociation.

An AD system must ensure:

  • primal equivalence,
  • differential equivalence.

Graph rewriting systems therefore often attach formal rewrite rules with derivative-preservation guarantees.

Equivalence and Reverse Mode

Reverse mode introduces additional semantic constraints.

The backward pass depends on:

  • execution order,
  • intermediate values,
  • tape contents,
  • graph structure.

Two equivalent forward programs may produce:

  • different tapes,
  • different memory usage,
  • different adjoint accumulation order.

Floating-point non-associativity may then produce numerically different gradients.

This issue appears prominently in:

  • distributed training,
  • parallel reductions,
  • asynchronous gradient aggregation.

Program equivalence becomes probabilistic or approximate rather than exact.

Observational Equivalence

A practical notion is observational equivalence.

Two differentiable programs are observationally equivalent if no external observer can distinguish them through:

  • outputs,
  • derivatives,
  • higher-order derivative queries.

This depends on the allowed observations.

Examples:

  • only primal outputs,
  • gradients,
  • Hessians,
  • runtime behavior,
  • memory consumption.

Different AD systems preserve different observational equivalences.

Equivalence and Compiler Correctness

AD compilers apply many transformations:

  • fusion,
  • staging,
  • differentiation,
  • simplification,
  • dead-code elimination,
  • checkpoint insertion.

Compiler correctness requires proving:

  • transformed program equivalence,
  • derivative equivalence,
  • higher-order equivalence.

Formal methods increasingly target this problem because:

  • large ML systems rely on compiler transformations,
  • incorrect gradients silently corrupt optimization,
  • floating-point behavior complicates reasoning.

Several modern differentiable compilers use:

  • theorem proving,
  • typed IRs,
  • verified rewrite systems.

Equivalence Classes of Differentiable Programs

Programs may be grouped by:

  • primal equivalence,
  • differential equivalence,
  • higher-order equivalence.

For smooth programs over exact reals these often coincide.

For real systems they diverge because of:

  • finite precision,
  • mutation,
  • nondifferentiability,
  • control flow,
  • randomness,
  • concurrency.

Understanding these distinctions is essential for building reliable AD infrastructure.

The Core Semantic Problem

Automatic differentiation transforms computations structurally.

Program equivalence therefore becomes a semantic foundation of AD correctness.

The central challenge is:

Which program transformations preserve derivatives?

Pure functional structure simplifies equivalence because:

  • substitution behaves compositionally,
  • sharing remains explicit,
  • mutation disappears.

Imperative structure complicates equivalence because:

  • execution order matters,
  • memory state evolves,
  • floating-point behavior changes.

Differentiable programming systems therefore increasingly rely on:

  • immutable IRs,
  • SSA representations,
  • compositional semantics,
  • formally verified rewrites.

The correctness of automatic differentiation ultimately depends on preserving equivalence not only of values, but of derivative structure itself.