Skip to content

Functional Languages

Functional programming languages provide a natural semantic foundation for automatic differentiation. Programs are expressed as compositions of functions, immutable values,...

Functional programming languages provide a natural semantic foundation for automatic differentiation. Programs are expressed as compositions of functions, immutable values, and transformations over data. This aligns closely with the mathematical structure of differentiation.

In a pure functional language, a function behaves like a mathematical mapping:

f:XY f : X \rightarrow Y

The absence of mutable state means evaluation can be analyzed as a composition graph. Automatic differentiation can therefore operate directly on the structure of computation.

Functional languages also emphasize higher-order functions, algebraic data types, and typed abstractions. These features allow AD systems to represent differentiation as a compositional program transformation rather than as an external runtime mechanism.

Pure Functions and Referential Transparency

A pure function always produces the same output for the same input and has no observable side effects.

square x = x * x

The expression

square 3

can always be replaced by 9.

This property is called referential transparency. It allows the compiler and AD system to reason about computations algebraically.

A derivative transformation can therefore be viewed as another pure function:

D(f)(x,x˙)=(f(x),Jf(x)x˙) D(f)(x, \dot{x}) = (f(x), J_f(x)\dot{x})

Forward mode lifts a function into one that propagates tangent information alongside primal values.

Forward Mode as Function Lifting

Suppose a function is defined as

f x = (x + 1) * sin x

Forward mode transforms it into:

fForward (x, dx) =
  let
    v1  = x + 1
    dv1 = dx

    v2  = sin x
    dv2 = cos x * dx

    v3  = v1 * v2
    dv3 = dv1 * v2 + v1 * dv2
  in
    (v3, dv3)

Each operation is lifted from ordinary arithmetic into tangent arithmetic.

This transformation is compositional. If functions f and g are differentiable:

D(fg)=D(f)D(g) D(f \circ g) = D(f) \circ D(g)

The derivative transformation itself becomes a structure-preserving mapping over programs.

Dual Numbers in Functional Languages

Functional languages often implement forward mode through dual numbers.

A dual number has the form:

a+bε a + b\varepsilon

with

ε2=0 \varepsilon^2 = 0

ε2=0 \varepsilon^2 = 0

In Haskell-like notation:

data Dual a = Dual a a

where:

FieldMeaning
First valuePrimal value
Second valueTangent value

Arithmetic operators are overloaded:

instance Num a => Num (Dual a) where
  (Dual x dx) + (Dual y dy) =
    Dual (x + y) (dx + dy)

  (Dual x dx) * (Dual y dy) =
    Dual (x * y) (dx * y + x * dy)

The product rule emerges directly from algebraic structure.

This approach is elegant because differentiation becomes ordinary evaluation over an extended numeric domain.

Higher-Order Functions

Functional languages rely heavily on higher-order functions.

Examples include:

map
foldl
filter
compose

An AD system must differentiate through these abstractions.

For example:

sumSquares xs =
  foldl (+) 0 (map (\x -> x * x) xs)

The differentiation system propagates derivatives through both:

  1. The lambda expression \x -> x * x
  2. The structural operators map and foldl

This creates a recursive view of differentiation:

ConstructDifferentiation behavior
Function compositionChain rule
MappingElementwise derivative propagation
FoldingSequential accumulation
RecursionRepeated derivative propagation

Functional structure therefore mirrors derivative structure.

Reverse Mode in Functional Languages

Reverse mode is more subtle.

The derivative must flow backward from outputs to inputs. This requires constructing a reverse dependency graph.

Functional languages often represent reverse mode using continuations, closures, or pullbacks.

Instead of returning only a value, a function returns:

  1. The primal result
  2. A backward propagation function

Conceptually:

f x = (y, backprop)

where:

backprop dy = dx

The backward function consumes an output adjoint and produces input adjoints.

This style appears in modern differentiable functional systems such as:

SystemTechnique
Haskell AD librariesContinuation-passing or tapes
JAXTracing + pullbacks
ZygoteSource transformation
Swift ADCompiler-generated adjoints

Continuation-Passing Style

Continuation-passing style (CPS) provides a useful interpretation of reverse mode.

A continuation represents “the rest of the computation.”

Instead of computing directly:

f x = y

the program becomes:

f x k = k y

where k is the continuation.

Reverse mode can then accumulate gradients by composing continuations backward.

This interpretation reveals a deep connection between:

ConceptInterpretation
Forward modeTangent propagation
Reverse modeContinuation propagation
Chain ruleComposition of continuations

Many theoretical treatments of reverse AD are formulated in CPS terms.

Lazy Evaluation

Some functional languages use lazy evaluation.

Expressions are evaluated only when needed.

x = expensiveComputation

does not execute immediately.

This complicates automatic differentiation because the order of evaluation becomes implicit.

A reverse-mode system needs a stable execution order to build adjoint dependencies. Laziness can therefore introduce:

ProblemEffect
Delayed evaluationUnclear tape structure
Shared thunksRepeated gradients
Infinite structuresNon-terminating differentiation
Hidden allocationsMemory unpredictability

Practical AD systems often restrict laziness or force evaluation during differentiation.

Typed Differentiation

Strong type systems are one of the most important advantages of functional languages.

The derivative operator itself can be typed.

For example:

D:(ab)(aT(a,b)) D : (a \rightarrow b) \rightarrow (a \rightarrow T(a,b))

where T(a,b) represents tangent structure.

In practice:

grad :: (Vector -> Scalar) -> Vector -> Vector

The type system ensures:

  • Gradients are only computed for differentiable functions
  • Tangent dimensions match primal dimensions
  • Invalid derivative compositions are rejected statically

Advanced systems encode even richer information:

Type featurePurpose
Linear typesControl gradient accumulation
Effect systemsTrack mutation
Shape typesVerify tensor dimensions
Differentiable constraintsRestrict AD domains

Typed AD becomes increasingly important in large differentiable systems.

Functional Intermediate Representations

Many modern AD systems lower programs into a functional intermediate representation (IR).

The IR is often:

  • Immutable
  • SSA-like
  • Graph-oriented
  • Purely functional

A functional IR simplifies optimization because dependencies are explicit.

For example:

v1 = add x 1
v2 = sin x
v3 = mul v1 v2

This representation behaves simultaneously as:

ViewInterpretation
ProgramSequence of computations
GraphDependency structure
AlgebraComposition of maps

Modern machine learning compilers frequently use this model internally even if the surface language is imperative.

Category-Theoretic Interpretation

Functional languages encouraged categorical interpretations of automatic differentiation.

A differentiable function can be lifted into a morphism carrying derivative structure:

f:XY f : X \rightarrow Y

becomes

D(f):X×TXY×TY D(f) : X \times TX \rightarrow Y \times TY

where TX is the tangent space.

The derivative transformation preserves composition:

D(fg)=D(f)D(g) D(f \circ g) = D(f) \circ D(g)

D(fg)=D(f)D(g) D(f \circ g) = D(f) \circ D(g)

This makes AD resemble a functor over computational structure.

Functional languages made these abstractions practical because programs were already represented compositionally.

Advantages of Functional Languages for AD

PropertyBenefit
Pure functionsClear dependency structure
ImmutabilityStable reverse-mode semantics
Higher-order functionsCompositional derivative operators
Strong typingStatic correctness guarantees
Algebraic structureNatural chain-rule interpretation
Functional IRsEasier optimization and transformation

These properties strongly influenced the design of modern differentiable programming systems.

Limitations

Functional languages also face practical difficulties.

Persistent immutable structures can increase allocation pressure. Reverse mode often requires storing intermediate states, producing large memory overhead. Lazy evaluation complicates execution ordering. Pure semantics may conflict with efficient GPU kernels or in-place tensor updates.

As a result, practical systems often introduce controlled impurity:

MechanismPurpose
Mutation under the hoodEfficient tensor updates
Linear typesSafe destructive operations
Runtime tracingDynamic graph construction
Compiler loweringHardware-efficient execution

Modern differentiable systems therefore combine functional semantics with low-level optimization techniques.

Functional Languages and Modern AD

Many ideas from functional programming now appear across modern AD frameworks:

Modern systemFunctional influence
JAXPure functional transformations
ZygoteSource-to-source differentiation
Swift ADTyped differentiable functions
EnzymeCompiler-level transformation
MLIRFunctional graph IR structure

Even frameworks written in imperative languages increasingly adopt functional internal representations because differentiation fundamentally operates on compositional structure.

Functional programming provided one of the clearest conceptual models for automatic differentiation: differentiation as a structure-preserving transformation over pure computations.