# Functional Languages

## Functional Languages

Functional programming languages provide a natural semantic foundation for automatic differentiation. Programs are expressed as compositions of functions, immutable values, and transformations over data. This aligns closely with the mathematical structure of differentiation.

In a pure functional language, a function behaves like a mathematical mapping:

$$
f : X \rightarrow Y
$$

The absence of mutable state means evaluation can be analyzed as a composition graph. Automatic differentiation can therefore operate directly on the structure of computation.

Functional languages also emphasize higher-order functions, algebraic data types, and typed abstractions. These features allow AD systems to represent differentiation as a compositional program transformation rather than as an external runtime mechanism.

### Pure Functions and Referential Transparency

A pure function always produces the same output for the same input and has no observable side effects.

```haskell
square x = x * x
```

The expression

```haskell
square 3
```

can always be replaced by `9`.

This property is called referential transparency. It allows the compiler and AD system to reason about computations algebraically.

A derivative transformation can therefore be viewed as another pure function:

$$
D(f)(x, \dot{x}) = (f(x), J_f(x)\dot{x})
$$

Forward mode lifts a function into one that propagates tangent information alongside primal values.

### Forward Mode as Function Lifting

Suppose a function is defined as

```haskell
f x = (x + 1) * sin x
```

Forward mode transforms it into:

```haskell
fForward (x, dx) =
  let
    v1  = x + 1
    dv1 = dx

    v2  = sin x
    dv2 = cos x * dx

    v3  = v1 * v2
    dv3 = dv1 * v2 + v1 * dv2
  in
    (v3, dv3)
```

Each operation is lifted from ordinary arithmetic into tangent arithmetic.

This transformation is compositional. If functions `f` and `g` are differentiable:

$$
D(f \circ g) = D(f) \circ D(g)
$$

The derivative transformation itself becomes a structure-preserving mapping over programs.

### Dual Numbers in Functional Languages

Functional languages often implement forward mode through dual numbers.

A dual number has the form:

$$
a + b\varepsilon
$$

with

$$
\varepsilon^2 = 0
$$

$$
\varepsilon^2 = 0
$$

In Haskell-like notation:

```haskell
data Dual a = Dual a a
```

where:

| Field | Meaning |
|---|---|
| First value | Primal value |
| Second value | Tangent value |

Arithmetic operators are overloaded:

```haskell
instance Num a => Num (Dual a) where
  (Dual x dx) + (Dual y dy) =
    Dual (x + y) (dx + dy)

  (Dual x dx) * (Dual y dy) =
    Dual (x * y) (dx * y + x * dy)
```

The product rule emerges directly from algebraic structure.

This approach is elegant because differentiation becomes ordinary evaluation over an extended numeric domain.

### Higher-Order Functions

Functional languages rely heavily on higher-order functions.

Examples include:

```haskell
map
foldl
filter
compose
```

An AD system must differentiate through these abstractions.

For example:

```haskell
sumSquares xs =
  foldl (+) 0 (map (\x -> x * x) xs)
```

The differentiation system propagates derivatives through both:

1. The lambda expression `\x -> x * x`
2. The structural operators `map` and `foldl`

This creates a recursive view of differentiation:

| Construct | Differentiation behavior |
|---|---|
| Function composition | Chain rule |
| Mapping | Elementwise derivative propagation |
| Folding | Sequential accumulation |
| Recursion | Repeated derivative propagation |

Functional structure therefore mirrors derivative structure.

### Reverse Mode in Functional Languages

Reverse mode is more subtle.

The derivative must flow backward from outputs to inputs. This requires constructing a reverse dependency graph.

Functional languages often represent reverse mode using continuations, closures, or pullbacks.

Instead of returning only a value, a function returns:

1. The primal result
2. A backward propagation function

Conceptually:

```haskell
f x = (y, backprop)
```

where:

```haskell
backprop dy = dx
```

The backward function consumes an output adjoint and produces input adjoints.

This style appears in modern differentiable functional systems such as:

| System | Technique |
|---|---|
| Haskell AD libraries | Continuation-passing or tapes |
| JAX | Tracing + pullbacks |
| Zygote | Source transformation |
| Swift AD | Compiler-generated adjoints |

### Continuation-Passing Style

Continuation-passing style (CPS) provides a useful interpretation of reverse mode.

A continuation represents "the rest of the computation."

Instead of computing directly:

```haskell
f x = y
```

the program becomes:

```haskell
f x k = k y
```

where `k` is the continuation.

Reverse mode can then accumulate gradients by composing continuations backward.

This interpretation reveals a deep connection between:

| Concept | Interpretation |
|---|---|
| Forward mode | Tangent propagation |
| Reverse mode | Continuation propagation |
| Chain rule | Composition of continuations |

Many theoretical treatments of reverse AD are formulated in CPS terms.

### Lazy Evaluation

Some functional languages use lazy evaluation.

Expressions are evaluated only when needed.

```haskell
x = expensiveComputation
```

does not execute immediately.

This complicates automatic differentiation because the order of evaluation becomes implicit.

A reverse-mode system needs a stable execution order to build adjoint dependencies. Laziness can therefore introduce:

| Problem | Effect |
|---|---|
| Delayed evaluation | Unclear tape structure |
| Shared thunks | Repeated gradients |
| Infinite structures | Non-terminating differentiation |
| Hidden allocations | Memory unpredictability |

Practical AD systems often restrict laziness or force evaluation during differentiation.

### Typed Differentiation

Strong type systems are one of the most important advantages of functional languages.

The derivative operator itself can be typed.

For example:

$$
D : (a \rightarrow b) \rightarrow (a \rightarrow T(a,b))
$$

where `T(a,b)` represents tangent structure.

In practice:

```haskell
grad :: (Vector -> Scalar) -> Vector -> Vector
```

The type system ensures:

- Gradients are only computed for differentiable functions
- Tangent dimensions match primal dimensions
- Invalid derivative compositions are rejected statically

Advanced systems encode even richer information:

| Type feature | Purpose |
|---|---|
| Linear types | Control gradient accumulation |
| Effect systems | Track mutation |
| Shape types | Verify tensor dimensions |
| Differentiable constraints | Restrict AD domains |

Typed AD becomes increasingly important in large differentiable systems.

### Functional Intermediate Representations

Many modern AD systems lower programs into a functional intermediate representation (IR).

The IR is often:

- Immutable
- SSA-like
- Graph-oriented
- Purely functional

A functional IR simplifies optimization because dependencies are explicit.

For example:

```text
v1 = add x 1
v2 = sin x
v3 = mul v1 v2
```

This representation behaves simultaneously as:

| View | Interpretation |
|---|---|
| Program | Sequence of computations |
| Graph | Dependency structure |
| Algebra | Composition of maps |

Modern machine learning compilers frequently use this model internally even if the surface language is imperative.

### Category-Theoretic Interpretation

Functional languages encouraged categorical interpretations of automatic differentiation.

A differentiable function can be lifted into a morphism carrying derivative structure:

$$
f : X \rightarrow Y
$$

becomes

$$
D(f) : X \times TX \rightarrow Y \times TY
$$

where `TX` is the tangent space.

The derivative transformation preserves composition:

$$
D(f \circ g) = D(f) \circ D(g)
$$

$$
D(f \circ g) = D(f) \circ D(g)
$$

This makes AD resemble a functor over computational structure.

Functional languages made these abstractions practical because programs were already represented compositionally.

### Advantages of Functional Languages for AD

| Property | Benefit |
|---|---|
| Pure functions | Clear dependency structure |
| Immutability | Stable reverse-mode semantics |
| Higher-order functions | Compositional derivative operators |
| Strong typing | Static correctness guarantees |
| Algebraic structure | Natural chain-rule interpretation |
| Functional IRs | Easier optimization and transformation |

These properties strongly influenced the design of modern differentiable programming systems.

### Limitations

Functional languages also face practical difficulties.

Persistent immutable structures can increase allocation pressure. Reverse mode often requires storing intermediate states, producing large memory overhead. Lazy evaluation complicates execution ordering. Pure semantics may conflict with efficient GPU kernels or in-place tensor updates.

As a result, practical systems often introduce controlled impurity:

| Mechanism | Purpose |
|---|---|
| Mutation under the hood | Efficient tensor updates |
| Linear types | Safe destructive operations |
| Runtime tracing | Dynamic graph construction |
| Compiler lowering | Hardware-efficient execution |

Modern differentiable systems therefore combine functional semantics with low-level optimization techniques.

### Functional Languages and Modern AD

Many ideas from functional programming now appear across modern AD frameworks:

| Modern system | Functional influence |
|---|---|
| JAX | Pure functional transformations |
| Zygote | Source-to-source differentiation |
| Swift AD | Typed differentiable functions |
| Enzyme | Compiler-level transformation |
| MLIR | Functional graph IR structure |

Even frameworks written in imperative languages increasingly adopt functional internal representations because differentiation fundamentally operates on compositional structure.

Functional programming provided one of the clearest conceptual models for automatic differentiation: differentiation as a structure-preserving transformation over pure computations.