Skip to content

Algebraic Semantics

Automatic differentiation is often introduced operationally. A program executes elementary operations, and derivative information propagates alongside the computation. This...

Automatic differentiation is often introduced operationally. A program executes elementary operations, and derivative information propagates alongside the computation. This operational view explains implementation, but it does not explain what differentiation fundamentally means at the semantic level.

Algebraic semantics studies differentiation as a structure-preserving transformation between programs, functions, and algebraic objects. Instead of viewing AD as a collection of local derivative rules, it treats differentiation as an algebraic operation acting on computations themselves.

This viewpoint becomes important when building:

  • differentiable programming languages,
  • verified AD systems,
  • compiler transformations,
  • higher-order differentiable languages,
  • symbolic-numeric hybrid systems,
  • differentiable intermediate representations.

The goal is to answer a deeper question:

What algebraic structure makes automatic differentiation possible?

Differentiation as Structure Preservation

Consider a function

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

Its derivative at a point xx is the linear map

Df(x):RnRm. Df(x) : \mathbb{R}^n \to \mathbb{R}^m.

The derivative transforms nonlinear behavior into local linear behavior.

The key observation is that differentiation preserves composition according to the chain rule:

D(fg)(x)=Df(g(x))Dg(x) D(f \circ g)(x)=Df(g(x))\cdot Dg(x)

This equation is not merely a calculus identity. It says differentiation respects program composition.

Suppose a program computes:

xgyfz. x \xrightarrow{g} y \xrightarrow{f} z.

Differentiation transforms the composed computation into composition of local linear maps.

This preservation law is the semantic foundation of AD.

Programs as Algebraic Expressions

A straight-line program can be modeled as an algebra generated from primitive operations.

For example:

z=xy+sin(x) z = x y + \sin(x)

can be viewed as an expression tree built from:

  • variables,
  • constants,
  • addition,
  • multiplication,
  • transcendental operators.

The derivative operator acts recursively over this algebra.

For sums:

D(f+g)=Df+Dg D(f + g) = Df + Dg

For products:

D(fg)=fDg+gDf D(fg) = f Dg + g Df

For composition:

D(fg)=(Dfg)Dg D(f \circ g) = (Df \circ g)\cdot Dg

These rules define differentiation as an algebra homomorphism enriched by the chain rule.

The structure resembles derivations in abstract algebra.

Derivations

Let AA be an algebra over a field KK. A derivation is a map

δ:AA \delta : A \to A

satisfying:

Linearity:

δ(a+b)=δ(a)+δ(b) \delta(a+b)=\delta(a)+\delta(b)

Scalar compatibility:

δ(λa)=λδ(a) \delta(\lambda a)=\lambda \delta(a)

Leibniz rule:

δ(ab)=aδ(b)+bδ(a) \delta(ab)=a\delta(b)+b\delta(a)

This abstract definition captures the algebraic essence of differentiation.

In automatic differentiation, tangent propagation behaves exactly like a derivation.

For example, forward-mode AD over dual numbers computes:

(x+ϵx˙)(y+ϵy˙)=xy+ϵ(xy˙+yx˙) (x + \epsilon \dot x)(y + \epsilon \dot y) = xy + \epsilon(x\dot y + y\dot x)

because

ϵ2=0. \epsilon^2 = 0.

The tangent component obeys the Leibniz rule automatically.

Thus forward-mode AD realizes derivations concretely inside an algebraic extension.

The Tangent Bundle Construction

Differentiation can also be interpreted geometrically.

For each point xx, the derivative acts on tangent vectors:

(x,v)(f(x),Df(x)v). (x,v) \mapsto (f(x), Df(x)v).

This defines the tangent map:

Tf:TRnTRm. Tf : T\mathbb{R}^n \to T\mathbb{R}^m.

Forward-mode AD computes exactly this transformation.

The tangent construction satisfies:

Identity preservation:

T(id)=id T(\mathrm{id}) = \mathrm{id}

Composition preservation:

T(fg)=TfTg. T(f \circ g)=Tf \circ Tg.

This means tangent propagation behaves functorially. The tangent operator transforms functions into derivative-aware functions while preserving composition structure.

Operationally, this is forward-mode AD.

Semantically, it is a functor on a category of smooth maps.

Dual Numbers as Semantic Objects

Forward mode is often explained mechanically using operator overloading. Algebraically, the true object is the dual-number algebra:

D=R[ϵ]/(ϵ2). \mathbb{D} = \mathbb{R}[\epsilon]/(\epsilon^2).

An element has the form:

a+ϵb. a + \epsilon b.

Applying a smooth function gives:

f(a+ϵb)=f(a)+ϵf(a)b. f(a+\epsilon b) = f(a)+\epsilon f'(a)b.

The derivative appears because higher-order terms vanish:

ϵ2=0. \epsilon^2=0.

Thus differentiation becomes ordinary evaluation inside an extended algebra.

This perspective is extremely important:

AD does not approximate derivatives numerically.

It embeds computation into a richer algebra where derivatives emerge from algebraic structure.

Forward-mode AD therefore becomes a semantic lifting operation:

ff^ f \mapsto \hat f

where

f^(a+ϵb)=f(a)+ϵDf(a)b. \hat f(a+\epsilon b) = f(a)+\epsilon Df(a)b.

Reverse Mode as Linear Transposition

Reverse mode has a different algebraic interpretation.

Forward mode propagates tangent vectors:

vJfv. v \mapsto J_f v.

Reverse mode propagates covectors:

wJfTw. w \mapsto J_f^T w.

Thus reverse mode computes the transpose action of the derivative map.

This distinction is fundamental.

Forward mode acts in tangent space.

Reverse mode acts in cotangent space.

The adjoint variables in reverse mode are not arbitrary bookkeeping structures. They are covectors propagated backward through transposed linear maps.

For a primitive operation:

z=xy, z = xy,

the local Jacobian is:

z(x,y)=(y,x). \frac{\partial z}{\partial(x,y)} = (y,x).

Reverse mode distributes incoming adjoints according to the transpose:

xˉ+=yzˉ \bar x += y\bar z yˉ+=xzˉ. \bar y += x\bar z.

The reverse sweep therefore computes a pullback operation on covectors.

Semantically, reverse mode corresponds to the dual action of the tangent map.

AD as Program Transformation

Suppose a program computes:

P:XY. P : X \to Y.

An AD transformation constructs another program:

D[P]:TXTY D[P] : TX \to TY

or, for reverse mode,

D[P]:TYTX. D^*[P] : TY^* \to TX^*.

The transformation preserves compositional structure:

D[P2P1]=D[P2]D[P1]. D[P_2 \circ P_1] = D[P_2]\circ D[P_1].

This property matters because large programs are built compositionally.

Without compositional semantics, AD systems would require whole-program symbolic manipulation. With compositional semantics, local differentiation rules combine automatically into global derivatives.

This explains why AD scales to millions of operations.

Linear Types and Differential Structure

Differentiation introduces a distinction between:

  • nonlinear values,
  • linearized perturbations.

A tangent vector should behave linearly:

D(f)(x)(av1+bv2)=aD(f)(x)v1+bD(f)(x)v2. D(f)(x)(av_1+bv_2) = aD(f)(x)v_1+bD(f)(x)v_2.

Many modern differentiable languages expose this distinction through type systems.

Examples include:

  • linear types,
  • tangent types,
  • cotangent types,
  • differential lambda calculi.

A semantic system for AD must therefore model:

  1. ordinary computation,
  2. linearized computation,
  3. interaction between them.

This separation becomes essential for:

  • higher-order AD,
  • compiler optimization,
  • memory-safe reverse mode,
  • differentiable functional languages.

Semantics of Primitive Operations

An AD system defines derivative semantics for primitives.

For addition:

(x,y)x+y (x,y)\mapsto x+y

the derivative map is:

(u,v)u+v. (u,v)\mapsto u+v.

For multiplication:

(x,y)xy (x,y)\mapsto xy

the derivative map is:

(u,v)uy+xv. (u,v)\mapsto uy+xv.

Each primitive operation carries:

  • primal semantics,
  • tangent semantics,
  • adjoint semantics.

Complex programs inherit derivative semantics compositionally.

This observation leads naturally to differentiable intermediate representations where every operation has an associated differential rule.

Modern ML frameworks internally rely on this structure.

Algebraic Effects of Mutation

Pure functional programs compose cleanly under differentiation.

Mutation complicates semantics because values evolve over time.

Consider:

x = x + 1

The meaning depends on program state.

Reverse mode becomes especially difficult because adjoints must correspond to earlier versions of variables. A mutated value may no longer exist during the backward sweep.

AD systems therefore introduce:

  • tapes,
  • SSA transformation,
  • persistent values,
  • effect tracking,
  • functionalization passes.

Semantically, mutation forces differentiation to interact with stateful algebraic effects.

This is one reason modern AD compilers often lower imperative programs into SSA-like functional representations before differentiation.

Commutative Diagrams

Many semantic properties of differentiation can be expressed using commuting diagrams.

Suppose:

f:XY. f : X \to Y.

Then differentiation defines:

Df:TXTY. Df : TX \to TY.

The following composition law holds:

TX ----Df----> TY
 |              |
Tg             Tf
 |              |
 v              v
TZ ----D(f∘g)-> TW

The diagram commutes because differentiation preserves composition.

This categorical viewpoint becomes central in:

  • differential categories,
  • tangent categories,
  • cartesian differential categories,
  • synthetic differential geometry.

These frameworks generalize differentiation beyond ordinary calculus.

Semantic Difference Between AD and Symbolic Differentiation

Symbolic differentiation manipulates expressions.

AD transforms computations.

This distinction is semantic.

Consider:

f(x)=i=11000xi. f(x)=\prod_{i=1}^{1000} x_i.

Symbolic differentiation may generate exponentially large expressions if sharing is lost.

AD preserves computational sharing because it differentiates the execution graph rather than algebraic syntax.

The derivative semantics are attached to program structure, not merely to formulas.

This difference explains why AD typically achieves linear complexity while symbolic differentiation may exhibit expression swell.

Semantic Layers of AD

Automatic differentiation simultaneously exists at several semantic levels:

LayerMeaning
NumericalEvaluate derivatives numerically
AlgebraicDerivations and chain-rule structure
GeometricTangent and cotangent propagation
ProgrammaticTransformation of computations
CategoricalFunctorial structure
Type-theoreticLinear and differential typing

Different AD systems emphasize different layers.

Operator-overloading systems emphasize numerical semantics.

Compiler-based systems emphasize program transformation semantics.

Research languages often emphasize categorical or type-theoretic semantics.

All of them implement the same underlying chain-rule structure.

The Core Semantic Idea

The essential algebraic insight is simple:

Differentiation is not merely a numerical procedure.

It is a compositional transformation preserving computational structure.

Forward mode lifts computations into tangent algebras.

Reverse mode propagates dual linear maps backward through compositions.

Programs become differentiable because differentiation respects the algebra of computation itself.