Skip to content

Differentiation as Functorial Transformation

The preceding sections described automatic differentiation through algebraic, categorical, logical, and denotational models. These viewpoints converge on one central idea:

The preceding sections described automatic differentiation through algebraic, categorical, logical, and denotational models. These viewpoints converge on one central idea:

Differentiation preserves structure.

A program is built by composing smaller computations. Automatic differentiation works because the derivative of the whole computation can be constructed from derivatives of the parts, in the same compositional order for forward mode and in the dual order for reverse mode.

This section states that idea in its most compact semantic form:

Differentiation is a functorial transformation of computation. \text{Differentiation is a functorial transformation of computation.}

Programs as a Category

Let programs form a category C\mathcal C.

Objects are types or spaces:

X,Y,Z X,Y,Z

Morphisms are programs:

f:XY. f : X \to Y.

Composition is program composition:

gf:XZ. g \circ f : X \to Z.

Identity morphisms are identity programs:

idX:XX. \mathrm{id}_X : X \to X.

This category captures the basic fact that programs compose.

If a program first computes ff, then computes gg, the total computation is gfg \circ f.

The Tangent Transformation

Forward-mode AD maps each space XX to its tangent space TXTX.

For a vector space, this can be represented as:

TX=X×X. TX = X \times X.

The first component is the primal value.

The second component is the tangent value.

A program:

f:XY f : X \to Y

is transformed into:

Tf:TXTY. Tf : TX \to TY.

The transformed program is:

Tf(x,v)=(f(x),Df(x)v). Tf(x,v)=(f(x),Df(x)v).

This is precisely forward-mode AD.

Identity Preservation

A functor must preserve identity.

The identity program is:

idX(x)=x. \mathrm{id}_X(x)=x.

Its derivative is the identity linear map.

Therefore:

T(idX)(x,v)=(x,v). T(\mathrm{id}_X)(x,v)=(x,v).

So:

T(idX)=idTX. T(\mathrm{id}_X)=\mathrm{id}_{TX}.

Forward-mode differentiation preserves identity computations.

This matters operationally because differentiating a no-op should produce a no-op on both primal and tangent values.

Composition Preservation

A functor must preserve composition.

Suppose:

XfYgZ. X \xrightarrow{f} Y \xrightarrow{g} Z.

Then:

T(gf)=TgTf. T(g\circ f)=Tg\circ Tf.

Expanding both sides gives:

T(gf)(x,v)=(g(f(x)),D(gf)(x)v). T(g\circ f)(x,v) = (g(f(x)),D(g\circ f)(x)v).

By the chain rule:

D(gf)(x)v=Dg(f(x))Df(x)v. D(g\circ f)(x)v = Dg(f(x))Df(x)v.

Now compute the right side:

Tf(x,v)=(f(x),Df(x)v). Tf(x,v)=(f(x),Df(x)v).

Then:

Tg(Tf(x,v))=(g(f(x)),Dg(f(x))Df(x)v). Tg(Tf(x,v)) = (g(f(x)),Dg(f(x))Df(x)v).

The two sides agree.

This is the chain rule written as functoriality.

Forward Mode as a Functor

Forward-mode AD is therefore a functor:

T:CC. T : \mathcal C \to \mathcal C.

It maps:

Program structureTangent transformation
Type XXTangent type TXTX
Program f:XYf : X \to YTangent program Tf:TXTYTf : TX \to TY
Identity programIdentity tangent program
CompositionComposition of tangent programs

This gives a precise semantic explanation of why forward mode is simple.

It follows the same direction as ordinary evaluation.

Reverse Mode as Contravariant Structure

Reverse mode has a related but different structure.

For:

f:XY, f : X \to Y,

reverse mode propagates cotangents:

Tf:X×YX. T^*f : X \times Y^* \to X^*.

Given a primal point xx and output cotangent yˉ\bar y, it computes:

xˉ=Df(x)Tyˉ. \bar x = Df(x)^T \bar y.

The cotangent flow reverses direction.

For composition:

gf, g\circ f,

the pullback satisfies:

D(gf)(x)T=Df(x)TDg(f(x))T. D(g\circ f)(x)^T = Df(x)^T Dg(f(x))^T.

The order reverses.

This is why reverse mode is contravariant in its cotangent component.

Pullbacks

Reverse mode is naturally described through pullbacks.

For a smooth map:

f:XY, f : X \to Y,

the derivative sends tangent vectors forward:

Df(x):TxXTf(x)Y. Df(x) : T_xX \to T_{f(x)}Y.

The pullback sends cotangents backward:

Df(x):Tf(x)YTxX. Df(x)^* : T^*_{f(x)}Y \to T^*_xX.

Reverse-mode AD computes this pullback.

For composition:

XfYgZ, X \xrightarrow{f} Y \xrightarrow{g} Z,

pullbacks compose in reverse order:

D(gf)(x)=Df(x)Dg(f(x)). D(g\circ f)(x)^* = Df(x)^* \circ Dg(f(x))^*.

This is the semantic reason reverse mode executes a backward pass.

Functoriality and Local Rules

AD systems are built from local derivative rules.

For example, a primitive operation:

z=xy z = xy

has tangent rule:

z˙=x˙y+xy˙. \dot z = \dot x y + x \dot y.

It also has adjoint rules:

xˉ+=yzˉ,yˉ+=xzˉ. \bar x += y\bar z, \quad \bar y += x\bar z.

Functoriality explains why these local rules are enough.

A large program is a composition of primitive morphisms. If each primitive is differentiated correctly, and differentiation preserves composition, the derivative of the whole program is correct.

This is the semantic justification for modular AD libraries.

Computational Graphs as Free Categories

A computational graph can be interpreted as a free category generated by primitive operations.

Nodes and edges define compositional structure.

A graph path represents composed computation.

AD extends this graph by assigning:

  • tangent rules for forward mode,
  • pullback rules for reverse mode.

Because the graph is compositional, differentiation can be performed node by node.

This gives a categorical explanation for graph-based AD systems.

Intermediate Representations

Compiler IRs make functorial transformation concrete.

A differentiable IR contains:

  • primitive operations,
  • type information,
  • data dependencies,
  • control-flow structure.

An AD transform maps this IR into another IR.

Forward mode maps values to primal-tangent pairs.

Reverse mode maps computations to primal code plus cotangent propagation code.

For correctness, the transform must preserve:

  • identities,
  • composition,
  • products,
  • linear structure.

This is functoriality expressed as compiler design.

Products and Tuples

Programs rarely operate on single values. They operate on tuples, arrays, records, and tensors.

Functorial differentiation must preserve product structure.

For a pair:

(x,y)X×Y, (x,y) \in X \times Y,

the tangent object is:

T(X×Y)TX×TY. T(X \times Y) \cong TX \times TY.

This means tangent transformation distributes over product types.

For a function pair:

f,g:XY×Z, \langle f,g\rangle : X \to Y \times Z,

we have:

Tf,g=Tf,Tg. T\langle f,g\rangle = \langle Tf,Tg\rangle.

This explains why AD systems can differentiate structured outputs component by component.

Linear Maps as Fixed Points of Differentiation

For a linear map:

L:XY, L : X \to Y,

the derivative is the map itself:

DL(x)v=L(v). DL(x)v = L(v).

Therefore:

TL(x,v)=(Lx,Lv). TL(x,v)=(Lx,Lv).

Linear maps are stable under differentiation.

This fact is heavily used in tensor systems:

  • matrix multiplication,
  • convolution,
  • linear projection,
  • tensor contraction.

A compiler can exploit linearity to avoid unnecessary derivative construction.

Higher-Order Functoriality

Applying AD repeatedly gives higher-order transformations.

Forward-over-forward mode constructs:

T(Tf). T(Tf).

Forward-over-reverse and reverse-over-forward compute mixed derivative products.

Functoriality ensures these transformations remain compositional, provided perturbations are correctly scoped.

Higher-order AD therefore requires:

  • nested tangent structures,
  • perturbation tags,
  • clean type discipline,
  • correct functor composition.

Without these, nested AD can suffer perturbation confusion.

Effects and Limits of Functoriality

Pure programs compose cleanly.

Effectful programs complicate the picture.

Effects include:

  • mutation,
  • I/O,
  • randomness,
  • exceptions,
  • concurrency,
  • nondeterminism.

For such programs, the category of computations must include effect structure.

Differentiation then must preserve not only value composition, but effectful composition.

This is difficult.

For example, random sampling may denote a stochastic computation rather than a smooth function. Mutation may require state-passing semantics. Exceptions may create partial functions.

A functorial AD theory must either:

  • restrict the language,
  • model effects explicitly,
  • define custom derivative semantics for effects.

Why Functoriality Matters

Functoriality gives the high-level correctness criterion for AD.

An AD transform should satisfy:

D[id]=id D[\mathrm{id}] = \mathrm{id}

and

D[gf]=D[g]D[f] D[g\circ f]=D[g]\circ D[f]

for forward mode, with the appropriate dual ordering for reverse mode.

These laws imply that local derivative rules scale to full programs.

They also guide practical implementation:

  • primitives need correct derivative definitions,
  • composition must be preserved,
  • graph rewrites must respect derivative semantics,
  • compiler passes must not break tangent or cotangent structure.

The Core View

Automatic differentiation is a structure-preserving transformation on programs.

Forward mode maps computations into tangent computations.

Reverse mode maps computations into cotangent pullbacks.

The chain rule is the law that makes this transformation compositional.

This is why AD scales from one primitive operation to entire differentiable systems.

The fundamental semantic equation is:

T(gf)=TgTf. T(g\circ f)=Tg\circ Tf.

For reverse mode, the corresponding law is:

(gf)=fg. (g\circ f)^*=f^*\circ g^*.

Together, these express the deepest reason automatic differentiation works: derivatives compose because programs compose.