# Lambda Calculus and AD

## Lambda Calculus and AD

Automatic differentiation becomes substantially more difficult once programs contain higher-order functions.

A first-order numerical program computes values:

$$
f : \mathbb{R}^n \to \mathbb{R}^m.
$$

A higher-order program computes functions, consumes functions, or returns functions:

$$
F : (X \to Y) \to Z.
$$

Ordinary AD systems built around tapes and computational graphs work well for first-order programs, but higher-order functions introduce semantic complications:
- closures,
- lexical environments,
- delayed evaluation,
- nested differentiation,
- perturbation confusion,
- mutable captures,
- higher-order reverse mode.

Lambda calculus provides the foundational model for higher-order computation. To build principled differentiable programming languages, differentiation must therefore be integrated into lambda calculus itself.

This section studies how differentiation interacts with:
- abstraction,
- application,
- substitution,
- closures,
- higher-order semantics.

The result is the foundation of differentiable functional programming.

## Lambda Calculus as a Model of Computation

Untyped lambda calculus consists of:
- variables,
- abstraction,
- application.

A lambda term is constructed from:

Variables:

$$
x
$$

Abstraction:

$$
\lambda x. t
$$

Application:

$$
t\ u.
$$

Functions are first-class values.

For example:

$$
\lambda x. x^2+1
$$

represents a function object.

Application:

$$
(\lambda x. x^2+1)\ 3
$$

reduces by substitution:

$$
3^2+1.
$$

Modern functional languages inherit this structure:
- Lisp,
- Haskell,
- OCaml,
- F#,
- Scheme,
- ML,
- functional IRs inside differentiable compilers.

Differentiable programming therefore requires differentiation rules over lambda terms.

## The Central Problem

Suppose:

$$
f=\lambda x. x^2.
$$

Its derivative should be:

$$
Df=\lambda(x,v). 2xv.
$$

Differentiation transforms functions into linearized functions.

But now consider higher-order structure:

$$
F=\lambda f. f(3)+f(4).
$$

What is:

$$
DF?
$$

The derivative must describe how perturbations of functions propagate through higher-order application.

Ordinary numerical calculus is insufficient because the perturbation object is itself functional.

This is the central challenge of differential lambda calculi.

## Differential Lambda Calculus

Differential lambda calculus extends ordinary lambda calculus with differentiation operators.

A derivative operator:

$$
D[t]
$$

acts syntactically on lambda terms.

The interpretation is:

$$
D[t](x,v)
$$

computes the directional derivative of $t$ at $x$ in direction $v$.

The calculus defines transformation rules compositionally.

For variables:

$$
D[x](x,v)=v.
$$

For constants:

$$
D[c](x,v)=0.
$$

For addition:

$$
D[t+s] =
D[t]+D[s].
$$

For multiplication:

$$
D[ts] =
tD[s]+sD[t].
$$

The difficult cases are abstraction and application.

## Differentiating Function Application

Suppose:

$$
t\ u
$$

represents function application.

The derivative satisfies the higher-order chain rule:

$$
D(t\ u)=D(t)\ u\ (D(u))
$$

This equation hides substantial semantic complexity.

The perturbation of:
- the function,
- the argument,

both influence the result.

Operationally:
- the function itself may change,
- the input value may change,
- closures may contain perturbed environments.

This interaction makes higher-order AD difficult.

## Tangent Transformation of Lambda Terms

Forward-mode AD can be viewed as a program transformation.

A term:

$$
t
$$

becomes a transformed term:

$$
\hat t.
$$

Each value carries:
- primal value,
- tangent value.

For example:

$$
x
\mapsto
(x,\dot x).
$$

Addition becomes:

$$
(x,\dot x)+(y,\dot y)
=
(x+y,\dot x+\dot y).
$$

Multiplication becomes:

$$
(x,\dot x)(y,\dot y)
=
(xy,x\dot y+y\dot x).
$$

Lambda abstraction transforms systematically.

A function:

$$
\lambda x.t
$$

becomes:

$$
\lambda(x,\dot x).\hat t.
$$

The transformed function computes both:
- ordinary evaluation,
- tangent propagation.

This gives a compositional forward-mode transformation over higher-order programs.

## Closures and Environment Capture

Functional programs rely heavily on closures.

Example:

```text
let a = 3
let f = λx. a*x
```

The function captures environment variable $a$.

Differentiation must determine:
- whether $a$ is differentiable,
- how perturbations propagate into closures,
- how captured environments are represented.

The transformed closure must therefore carry:
- primal environment,
- tangent environment.

A differentiated closure becomes structurally larger than the original closure.

This issue becomes particularly important in:
- JIT compilers,
- staged differentiable languages,
- higher-order reverse mode,
- differentiable interpreters.

## Reverse Mode in Higher-Order Languages

Forward mode composes naturally with lambda calculus because tangent propagation follows evaluation order.

Reverse mode is harder.

A reverse-mode system must:
- record applications,
- preserve closure environments,
- accumulate adjoints,
- reconstruct call structure.

Suppose:

$$
f=\lambda x. x^2
$$

and:

$$
g=\lambda h. h(3).
$$

Evaluating:

$$
g(f)
$$

creates nested functional structure.

The reverse pass must:
- reconstruct the call graph,
- propagate adjoints through higher-order application,
- correctly handle lexical scope.

This requires sophisticated representations:
- closure conversion,
- CPS transformation,
- tapes with environments,
- delimited continuations,
- staged IRs.

Higher-order reverse mode remains one of the hardest areas in AD implementation.

## Perturbation Confusion

Nested differentiation introduces a subtle bug called perturbation confusion.

Suppose a forward-mode system uses infinitesimals:

$$
x+\epsilon v.
$$

Now differentiate a program that itself performs differentiation.

Two nested derivative computations may accidentally share the same perturbation symbol:

$$
\epsilon.
$$

This produces incorrect derivatives because tangent information mixes between differentiation levels.

Example:

$$
D(D[f]).
$$

If perturbations are not isolated correctly, higher-order derivatives become corrupted.

Solutions include:
- tagged perturbations,
- fresh infinitesimals,
- lexical perturbation scopes,
- tower constructions,
- type-indexed perturbations.

This problem appears specifically because lambda calculus permits higher-order differentiation naturally.

## Linear Types and Differential Structure

A derivative is linear in perturbations.

Differential lambda calculi therefore often introduce linear types.

A linear function consumes its argument exactly once.

This distinction matters because:
- tangent values are linear,
- cotangent accumulation is additive,
- reverse mode propagates linear functionals.

Linear type systems help:
- prevent accidental duplication,
- control memory,
- formalize gradient propagation,
- optimize reverse-mode execution.

Modern differentiable functional IRs increasingly incorporate linear typing ideas.

## Differential Substitution

Substitution is central in lambda calculus.

Ordinary beta reduction:

$$
(\lambda x.t)\ u
\to
t[u/x].
$$

Differentiation must interact consistently with substitution.

A differential substitution law expresses how perturbations propagate through substitution.

Intuitively:
- replacing a variable changes the program,
- perturbing the replacement changes the output.

The derivative operator must therefore commute appropriately with substitution structure.

This becomes essential for:
- correctness proofs,
- higher-order differentiation,
- compiler transformations.

## Cartesian Closed Differential Categories

Lambda calculus corresponds categorically to cartesian closed categories.

Differential lambda calculi require:
- cartesian closure,
- differential structure,
- compatibility between them.

A cartesian closed differential category supports:
- higher-order functions,
- abstraction,
- application,
- differentiation.

This allows differentiation of:
- functions,
- closures,
- functionals,
- higher-order programs.

The semantics unify:
- functional programming,
- differential structure,
- compositional AD.

## CPS and Reverse Mode

Continuation-passing style (CPS) is deeply connected to reverse-mode AD.

A continuation represents:
- the future of a computation.

Reverse mode also propagates information backward from future computations.

This correspondence is not accidental.

Many reverse-mode systems can be formulated as CPS transformations.

Operationally:
- the forward pass builds continuations,
- the backward pass executes them in reverse.

This relationship explains why:
- continuations,
- adjoints,
- cotangent propagation,

share similar semantics.

Several differentiable functional languages use CPS internally for reverse-mode transformation.

## Closures as Differentiable Objects

A closure contains:
- code,
- environment.

Differentiating a closure therefore means differentiating both:
- executable behavior,
- captured state.

A differentiated closure may need:
- primal environment,
- tangent environment,
- adjoint environment.

This complicates memory management substantially.

Large higher-order differentiable programs may allocate many temporary closures during differentiation.

Compiler optimizations therefore become critical:
- closure elimination,
- environment sharing,
- lambda lifting,
- staging,
- partial evaluation.

## Lazy Evaluation and AD

Lazy languages introduce additional complications.

A thunk may:
- execute later,
- execute multiple times,
- never execute.

Reverse mode assumes a concrete execution trace.

Laziness therefore complicates:
- tape construction,
- adjoint scheduling,
- memory lifetime.

Differentiable lazy languages require careful coordination between:
- evaluation strategy,
- differentiation semantics.

Several experimental systems address this using:
- explicit forcing,
- graph reduction semantics,
- demand-driven differentiation.

## Functional Programming and Modern AD Systems

Many modern differentiable systems internally resemble functional languages even when user syntax appears imperative.

Examples:
- JAX traces pure functional graphs,
- MLIR functionalizes tensor programs,
- Enzyme lowers into SSA-style IR,
- differentiable compilers prefer immutable representations.

Functional structure simplifies differentiation because:
- evaluation order becomes explicit,
- mutation disappears,
- closures are analyzable,
- transformations compose cleanly.

This is why many AD systems transform imperative programs into functional IRs before differentiation.

## The Core Difficulty

First-order differentiation transforms numbers.

Higher-order differentiation transforms computations themselves.

Lambda calculus exposes this distinction directly.

Differentiable programming languages therefore require:
- differential abstraction,
- differentiable application,
- differentiable substitution,
- differentiable closures.

Forward mode fits naturally into compositional functional evaluation.

Reverse mode requires explicit representation of future computation and environment structure.

The interaction between:
- higher-order functions,
- linearity,
- closures,
- continuations,
- perturbations,

forms one of the deepest semantic foundations of automatic differentiation.