# Program Equivalence

## Program Equivalence

Automatic differentiation transforms programs. A fundamental semantic question therefore arises:

> When do two programs have the same derivative behavior?

Program equivalence studies when two computations should be considered identical. In differentiable systems this becomes more subtle because:
- two programs may compute the same values but different derivatives,
- two programs may produce equivalent derivatives through different computational structures,
- floating-point behavior may preserve primal equivalence while changing gradient behavior,
- mutation and control flow may affect differentiability.

Equivalence matters because AD systems routinely transform programs through:
- optimization,
- graph rewriting,
- fusion,
- common subexpression elimination,
- operator reassociation,
- checkpointing,
- staging,
- partial evaluation.

If these transformations change derivatives incorrectly, the AD system becomes unsound.

This section studies semantic equivalence for differentiable programs.

## Ordinary Functional Equivalence

Two functions are extensionally equivalent if they produce the same outputs for all inputs:

$$
f(x)=g(x)
\quad \forall x.
$$

Example:

$$
f(x)=x^2+2x+1
$$

$$
g(x)=(x+1)^2.
$$

These programs compute identical functions.

Their derivatives are also identical:

$$
f'(x)=g'(x)=2x+2.
$$

Ordinary mathematics assumes this property automatically:
- equivalent functions have equivalent derivatives.

For real differentiable functions this is true.

For programs, complications appear.

## Computational Structure Matters

AD differentiates computations, not merely denotational functions.

Suppose:

```text
y = x*x
z = y+y
```

and:

```text
z = 2*x*x
```

Mathematically they are equivalent.

Operationally they differ:
- different computational graphs,
- different intermediate values,
- different memory behavior,
- different floating-point evaluation order.

An AD system should ideally produce equivalent derivatives.

But floating-point arithmetic complicates this.

## Floating-Point Non-Equivalence

Floating-point arithmetic is not associative:

$$
(a+b)+c
\ne
a+(b+c).
$$

Suppose:

```text
y = (a+b)+c
```

versus:

```text
y = a+(b+c)
```

The primal outputs may differ slightly.

Their derivatives may also differ numerically because:
- rounding occurs differently,
- intermediate magnitudes change,
- cancellation patterns differ.

Thus algebraic rewrites that preserve real semantics may alter floating-point derivatives.

This distinction is important in:
- compiler optimization,
- kernel fusion,
- graph rewriting,
- distributed reduction operations.

AD correctness therefore depends on the semantic level:
- exact real semantics,
- floating-point operational semantics.

## Extensional vs Intensional Equivalence

Two notions of equivalence become important.

### Extensional Equivalence

Programs compute the same function:

$$
f(x)=g(x).
$$

### Intensional Equivalence

Programs have equivalent computational structure.

AD systems often operate intensionally because derivatives are extracted from execution structure.

This creates tension:
- mathematically equivalent programs,
- operationally different derivatives.

Compiler optimizations must preserve the relevant notion of equivalence.

## Differential Equivalence

Differentiable programs require stronger equivalence.

Two programs are differentially equivalent if:
- their primal outputs agree,
- their derivatives agree.

Formally:

$$
f(x)=g(x)
$$

and

$$
Df(x)=Dg(x).
$$

For smooth real functions, extensional equivalence implies differential equivalence.

For programs this may fail because:
- undefined regions,
- control flow,
- floating-point behavior,
- mutation,
- nondifferentiable operations.

Example:

$$
f(x)=x|x|
$$

$$
g(x)=
\begin{cases}
x^2 & x\ge0 \\
-x^2 & x<0
\end{cases}
$$

These agree extensionally.

At $x=0$, differentiability semantics become subtle depending on representation and evaluation strategy.

## Control Flow Equivalence

Conditionals complicate equivalence.

Consider:

```text
if x > 0:
    y = x*x
else:
    y = 0
```

versus:

```text
y = max(x,0)*x
```

These may compute the same values.

Their derivatives differ at branch boundaries because:
- one representation exposes explicit discontinuity,
- another uses piecewise algebraic structure.

AD systems operating structurally may produce different subgradient behavior.

This matters heavily in machine learning:
- ReLU activations,
- max pooling,
- clipping,
- sparse routing,
- conditional execution.

Program equivalence becomes tied to nonsmooth semantics.

## Equivalence Under AD Transformation

An AD transformation:

$$
P \mapsto D[P]
$$

should preserve equivalence.

If:

$$
P_1 \equiv P_2,
$$

then ideally:

$$
D[P_1]\equiv D[P_2].
$$

This property is called semantic preservation.

Compiler-based AD systems depend critically on it.

Without semantic preservation:
- optimization becomes unsafe,
- graph rewrites corrupt gradients,
- staging changes derivatives,
- differentiation order matters incorrectly.

Formal verification of AD systems largely studies this property.

## Beta and Eta Equivalence

In lambda calculus, equivalence includes:
- beta equivalence,
- eta equivalence.

### Beta Reduction

$$
(\lambda x.t)\ u
\to
t[u/x].
$$

### Eta Reduction

$$
\lambda x.f(x)
\equiv
f.
$$

Differentiation must remain compatible with these transformations.

Suppose:

$$
f=\lambda x.x^2.
$$

Then:

$$
D[\lambda x.x^2]
$$

must behave consistently under beta reduction.

Higher-order differentiable languages therefore require:
- substitution-preserving differentiation,
- compositional equivalence laws.

## Referential Transparency

Pure functional programs satisfy referential transparency:

An expression can be replaced by its value without changing behavior.

Example:

```text
let y = x*x
y+y
```

can be rewritten:

```text
(x*x)+(x*x)
```

without changing mathematical meaning.

In differentiable systems, referential transparency is extremely valuable because:
- graph transformations remain valid,
- sharing can be preserved,
- gradients remain compositional.

Mutation breaks this property.

## Mutation and Equivalence

Suppose:

```text
x = x + 1
```

The meaning depends on program state.

Now consider:

```text
x = x + 1
y = x*x
```

versus:

```text
y = (x+1)*(x+1)
```

These may appear equivalent.

Under mutation semantics:
- aliasing,
- evaluation timing,
- memory reuse,

can alter behavior.

Reverse mode becomes particularly fragile because the backward pass depends on earlier program states.

AD systems therefore often transform imperative programs into:
- SSA form,
- functional IR,
- immutable graph structure.

This restores compositional equivalence properties.

## Common Subexpression Elimination

Compiler optimizations often merge repeated computations.

Example:

```text
a = expensive(x)
y = a+a
```

versus:

```text
y = expensive(x)+expensive(x)
```

Mathematically:

$$
y=2\,\mathrm{expensive}(x).
$$

Operationally:
- one evaluates once,
- one evaluates twice.

If:
- floating-point behavior,
- randomness,
- side effects,
- mutation,

exist, derivatives may differ.

AD correctness therefore depends on whether the language semantics permit such rewrites.

Pure functional semantics greatly simplify this problem.

## Equivalence of Computational Graphs

A computational graph represents dependency structure.

Different graphs may represent equivalent functions.

Example:

```text
x -> square -> add
```

versus:

```text
x -> multiply-by-2 -> square
```

Graph optimizers rewrite graphs aggressively:
- node fusion,
- algebraic simplification,
- constant folding,
- operator reassociation.

An AD system must ensure:
- primal equivalence,
- differential equivalence.

Graph rewriting systems therefore often attach formal rewrite rules with derivative-preservation guarantees.

## Equivalence and Reverse Mode

Reverse mode introduces additional semantic constraints.

The backward pass depends on:
- execution order,
- intermediate values,
- tape contents,
- graph structure.

Two equivalent forward programs may produce:
- different tapes,
- different memory usage,
- different adjoint accumulation order.

Floating-point non-associativity may then produce numerically different gradients.

This issue appears prominently in:
- distributed training,
- parallel reductions,
- asynchronous gradient aggregation.

Program equivalence becomes probabilistic or approximate rather than exact.

## Observational Equivalence

A practical notion is observational equivalence.

Two differentiable programs are observationally equivalent if no external observer can distinguish them through:
- outputs,
- derivatives,
- higher-order derivative queries.

This depends on the allowed observations.

Examples:
- only primal outputs,
- gradients,
- Hessians,
- runtime behavior,
- memory consumption.

Different AD systems preserve different observational equivalences.

## Equivalence and Compiler Correctness

AD compilers apply many transformations:
- fusion,
- staging,
- differentiation,
- simplification,
- dead-code elimination,
- checkpoint insertion.

Compiler correctness requires proving:
- transformed program equivalence,
- derivative equivalence,
- higher-order equivalence.

Formal methods increasingly target this problem because:
- large ML systems rely on compiler transformations,
- incorrect gradients silently corrupt optimization,
- floating-point behavior complicates reasoning.

Several modern differentiable compilers use:
- theorem proving,
- typed IRs,
- verified rewrite systems.

## Equivalence Classes of Differentiable Programs

Programs may be grouped by:
- primal equivalence,
- differential equivalence,
- higher-order equivalence.

For smooth programs over exact reals these often coincide.

For real systems they diverge because of:
- finite precision,
- mutation,
- nondifferentiability,
- control flow,
- randomness,
- concurrency.

Understanding these distinctions is essential for building reliable AD infrastructure.

## The Core Semantic Problem

Automatic differentiation transforms computations structurally.

Program equivalence therefore becomes a semantic foundation of AD correctness.

The central challenge is:

> Which program transformations preserve derivatives?

Pure functional structure simplifies equivalence because:
- substitution behaves compositionally,
- sharing remains explicit,
- mutation disappears.

Imperative structure complicates equivalence because:
- execution order matters,
- memory state evolves,
- floating-point behavior changes.

Differentiable programming systems therefore increasingly rely on:
- immutable IRs,
- SSA representations,
- compositional semantics,
- formally verified rewrites.

The correctness of automatic differentiation ultimately depends on preserving equivalence not only of values, but of derivative structure itself.