# Chapter 11. Compiler and Runtime Design

## Source Transformation

Source transformation is an implementation strategy for automatic differentiation in which a program that computes a function is rewritten into another program that computes derivatives of that function.

The input is source code, or an intermediate representation close enough to source code. The output is new code with derivative computation built into it.

For example, a function

```text
y = f(x)
```

may be transformed into a function

```text
(y, dy) = df(x, dx)
```

for forward mode, or into a pair of functions

```text
y = f(x)
dx = f_pullback(dy)
```

for reverse mode.

The key idea is simple: automatic differentiation can be implemented as compilation.

Instead of interpreting every operation at runtime, the AD system analyzes the program structure, applies differentiation rules, and emits a new program. That generated program can then be optimized by the normal compiler.

### Source Transformation vs Operator Overloading

Operator overloading implements AD by redefining arithmetic operations on special numeric types. Source transformation instead rewrites the program itself.

Consider a simple function:

```python
def f(x):
    a = x * x
    b = sin(a)
    return b
```

A forward-mode source transformation may generate code like this:

```python
def df(x, dx):
    a = x * x
    da = dx * x + x * dx

    b = sin(a)
    db = cos(a) * da

    return b, db
```

A reverse-mode source transformation may generate code like this:

```python
def f_with_pullback(x):
    a = x * x
    b = sin(a)

    def pullback(db):
        da = cos(a) * db
        dx = da * x + x * da
        return dx

    return b, pullback
```

The transformed code exposes derivative computation directly. There is no need for runtime tracing if the whole program is available ahead of time.

### Forward Source Transformation

Forward transformation propagates primal values and tangent values together.

For every variable `v`, the transformed program carries two values:

```text
v   primal value
dv  tangent value
```

Each primitive operation receives a local differentiation rule.

| Original operation | Transformed operation |
|---|---|
| `z = x + y` | `z = x + y`, `dz = dx + dy` |
| `z = x * y` | `z = x * y`, `dz = dx * y + x * dy` |
| `z = sin(x)` | `z = sin(x)`, `dz = cos(x) * dx` |
| `z = exp(x)` | `z = exp(x)`, `dz = exp(x) * dx` |

This produces code that computes a Jacobian-vector product:

```text
df(x, dx) = (f(x), J_f(x) dx)
```

Forward source transformation is structurally simple. The transformed code has the same control flow as the original code. Branches, loops, and function calls are copied, with tangent variables threaded through them.

### Reverse Source Transformation

Reverse transformation is more complex because information flows backward.

The original program computes values from inputs to outputs. Reverse mode first runs the primal computation, then propagates adjoints from outputs back to inputs.

For every variable `v`, reverse mode introduces an adjoint:

```text
bar_v = ∂L / ∂v
```

where `L` is the scalar output or scalar objective being differentiated.

A simple program:

```text
a = x * x
b = sin(a)
return b
```

has a reverse pass:

```text
bar_b = 1
bar_a += cos(a) * bar_b
bar_x += x * bar_a
bar_x += x * bar_a
```

The first pass computes and saves values needed by the reverse pass. The second pass walks backward through the operations.

```text
forward:
    a = x * x
    b = sin(a)

reverse:
    bar_b = seed
    bar_a = cos(a) * bar_b
    bar_x = 2 * x * bar_a
```

Reverse source transformation computes a vector-Jacobian product:

```text
v^T J_f(x)
```

or, for a scalar output:

```text
∇f(x)
```

This is why reverse mode is central to machine learning. A neural network may have millions or billions of parameters but a single scalar loss. Reverse mode computes all parameter gradients with cost comparable to a small constant multiple of the original forward computation.

### Why Source Transformation Is Attractive

Source transformation gives the AD system access to program structure before execution. This creates several advantages.

The generated derivative code can be optimized by ordinary compiler passes. Dead derivative computations can be removed. Common subexpressions can be reused. Loops can be optimized. Memory allocation can be planned. Derivative code can be fused with primal code.

Source transformation also avoids some runtime overheads of operator overloading. There is no need to allocate a dual-number object for every scalar operation. There is no need to dispatch every arithmetic operation through overloaded methods. The differentiated program can look like ordinary numeric code.

This matters for high-performance scientific computing, machine learning compilers, GPU kernels, and large numerical simulation programs.

### The Basic Transformation Rules

A source transformation system needs rules for each syntactic form in the language.

For assignment:

```text
v = expr
```

the transformed program computes the primal value and derivative value.

For sequencing:

```text
stmt1
stmt2
```

the transformation applies to each statement in order for forward mode, and in reverse order for reverse mode.

For conditionals:

```text
if cond:
    y = f(x)
else:
    y = g(x)
```

the transformed program differentiates the branch that actually executes.

Forward mode:

```text
if cond:
    y, dy = df(x, dx)
else:
    y, dy = dg(x, dx)
```

Reverse mode must record which branch executed so that the reverse pass follows the same path.

For loops:

```text
for i in range(n):
    x = body(x)
```

forward mode differentiates the loop body at each iteration. Reverse mode must replay the loop in reverse order, often requiring saved intermediate values from each iteration.

This creates the main systems problem in reverse mode: values needed later by the backward pass must be retained, recomputed, or checkpointed.

### Source Transformation and SSA

Source transformation is easier when the program has been lowered into static single assignment form.

In SSA, each variable is assigned exactly once:

```text
x1 = input
x2 = x1 * x1
x3 = sin(x2)
return x3
```

This form makes data dependencies explicit. It removes ambiguity caused by reassignment.

Reverse mode over SSA is especially clean. Each primal variable gets an adjoint variable:

```text
bar_x3 = seed
bar_x2 += cos(x2) * bar_x3
bar_x1 += 2 * x1 * bar_x2
```

SSA also handles branches through phi nodes, which merge values from different control-flow paths.

```text
if cond:
    x1 = a
else:
    x2 = b
x3 = phi(x1, x2)
```

The derivative transformation must propagate adjoints through the same control-flow merge, respecting which path produced the value.

### Activity Analysis

Not every variable affects the output. Not every variable depends on differentiable inputs.

Source transformation systems often perform activity analysis. A variable is active if it depends on differentiable inputs and can influence differentiable outputs.

Inactive code can be left unchanged. This avoids unnecessary derivative work.

For example:

```text
def f(x, name):
    y = x * x
    print(name)
    return y
```

The variable `name` is inactive. The print operation also has no derivative contribution. The transformed derivative program does not need a tangent or adjoint for `name`.

Activity analysis becomes important in large programs where only part of the computation participates in differentiation.

### Handling Mutation

Mutation complicates source transformation.

A simple mathematical expression has clear data dependencies. A program with mutable arrays, references, aliasing, and in-place updates has hidden dependencies.

Example:

```text
a[i] = a[i] + x
```

The derivative transformation must know whether other variables alias `a`, whether `i` changes, and whether old values of `a[i]` are needed during the reverse pass.

Reverse mode is especially sensitive to mutation because the backward pass often needs primal values from before mutation.

A compiler-level AD system therefore needs alias analysis, effect tracking, or a restricted programming model.

Many modern AD systems avoid unrestricted mutation in differentiated code. Others lower programs into functional or SSA-like intermediate representations before differentiation.

### Generated Code Quality

The quality of source-transformed AD depends heavily on the compiler pipeline.

Naive transformation can generate bloated code:

```text
dx = dx * y + x * dy
```

may be repeated many times, intermediate values may be stored unnecessarily, and reverse passes may save more primal values than needed.

A good source transformation pipeline applies optimization after differentiation:

| Optimization | Purpose |
|---|---|
| Dead code elimination | Remove unused derivative computations |
| Common subexpression elimination | Reuse repeated primal or derivative expressions |
| Inlining | Expose derivative rules across function boundaries |
| Loop optimization | Improve derivative loops |
| Memory planning | Reduce tape and adjoint storage |
| Fusion | Combine primal and derivative kernels |
| Checkpointing | Trade recomputation for memory |

The derivative compiler should not merely produce correct code. It should produce code that the backend compiler can optimize effectively.

### Advantages

Source transformation has several practical strengths.

It can generate efficient derivative code ahead of time. It works well with optimizing compilers. It can target low-level backends such as LLVM, MLIR, CUDA, or specialized accelerator IRs. It can analyze whole programs rather than isolated operations.

It also fits naturally into compiler infrastructure. AD becomes another compiler pass, like inlining, vectorization, or lowering.

This makes source transformation suitable for systems where performance, static analysis, and deployment matter.

### Limitations

Source transformation also has costs.

It requires access to program representation. It must understand the syntax and semantics of the language. It must handle control flow, mutation, aliasing, function calls, closures, exceptions, dynamic dispatch, and foreign calls.

For dynamic languages, source transformation can be difficult because the meaning of a program may depend on runtime values. Many systems therefore trace execution first, then transform the trace rather than the original source.

Reverse transformation also introduces memory pressure. The transformed program must retain or recover values needed by the backward pass. Without careful checkpointing, reverse mode can consume too much memory.

### Source Transformation as Compiler AD

The compiler view of AD can be summarized as follows:

```text
original program
    -> parsing or tracing
    -> intermediate representation
    -> activity analysis
    -> derivative transformation
    -> optimization
    -> code generation
```

The central design choice is the level at which differentiation occurs.

Differentiating high-level source code preserves semantic structure such as loops, arrays, and function calls. Differentiating low-level IR gives better control over memory and backend optimization. Differentiating graph IR gives a clean tensor-oriented model but may lose general programming features.

A mature AD compiler often uses several levels. It may transform high-level functions, lower them into SSA, optimize derivative code, and finally emit machine code or accelerator kernels.

Source transformation treats differentiation as a static program transformation problem. This view connects AD to compiler theory, type systems, intermediate representations, and program optimization. It is the foundation for many high-performance AD systems.