Skip to content

Chapter 11. Compiler and Runtime Design

Source transformation is an implementation strategy for automatic differentiation in which a program that computes a function is rewritten into another program that computes...

Source Transformation

Source transformation is an implementation strategy for automatic differentiation in which a program that computes a function is rewritten into another program that computes derivatives of that function.

The input is source code, or an intermediate representation close enough to source code. The output is new code with derivative computation built into it.

For example, a function

y = f(x)

may be transformed into a function

(y, dy) = df(x, dx)

for forward mode, or into a pair of functions

y = f(x)
dx = f_pullback(dy)

for reverse mode.

The key idea is simple: automatic differentiation can be implemented as compilation.

Instead of interpreting every operation at runtime, the AD system analyzes the program structure, applies differentiation rules, and emits a new program. That generated program can then be optimized by the normal compiler.

Source Transformation vs Operator Overloading

Operator overloading implements AD by redefining arithmetic operations on special numeric types. Source transformation instead rewrites the program itself.

Consider a simple function:

def f(x):
    a = x * x
    b = sin(a)
    return b

A forward-mode source transformation may generate code like this:

def df(x, dx):
    a = x * x
    da = dx * x + x * dx

    b = sin(a)
    db = cos(a) * da

    return b, db

A reverse-mode source transformation may generate code like this:

def f_with_pullback(x):
    a = x * x
    b = sin(a)

    def pullback(db):
        da = cos(a) * db
        dx = da * x + x * da
        return dx

    return b, pullback

The transformed code exposes derivative computation directly. There is no need for runtime tracing if the whole program is available ahead of time.

Forward Source Transformation

Forward transformation propagates primal values and tangent values together.

For every variable v, the transformed program carries two values:

v   primal value
dv  tangent value

Each primitive operation receives a local differentiation rule.

Original operationTransformed operation
z = x + yz = x + y, dz = dx + dy
z = x * yz = x * y, dz = dx * y + x * dy
z = sin(x)z = sin(x), dz = cos(x) * dx
z = exp(x)z = exp(x), dz = exp(x) * dx

This produces code that computes a Jacobian-vector product:

df(x, dx) = (f(x), J_f(x) dx)

Forward source transformation is structurally simple. The transformed code has the same control flow as the original code. Branches, loops, and function calls are copied, with tangent variables threaded through them.

Reverse Source Transformation

Reverse transformation is more complex because information flows backward.

The original program computes values from inputs to outputs. Reverse mode first runs the primal computation, then propagates adjoints from outputs back to inputs.

For every variable v, reverse mode introduces an adjoint:

bar_v = ∂L / ∂v

where L is the scalar output or scalar objective being differentiated.

A simple program:

a = x * x
b = sin(a)
return b

has a reverse pass:

bar_b = 1
bar_a += cos(a) * bar_b
bar_x += x * bar_a
bar_x += x * bar_a

The first pass computes and saves values needed by the reverse pass. The second pass walks backward through the operations.

forward:
    a = x * x
    b = sin(a)

reverse:
    bar_b = seed
    bar_a = cos(a) * bar_b
    bar_x = 2 * x * bar_a

Reverse source transformation computes a vector-Jacobian product:

v^T J_f(x)

or, for a scalar output:

∇f(x)

This is why reverse mode is central to machine learning. A neural network may have millions or billions of parameters but a single scalar loss. Reverse mode computes all parameter gradients with cost comparable to a small constant multiple of the original forward computation.

Why Source Transformation Is Attractive

Source transformation gives the AD system access to program structure before execution. This creates several advantages.

The generated derivative code can be optimized by ordinary compiler passes. Dead derivative computations can be removed. Common subexpressions can be reused. Loops can be optimized. Memory allocation can be planned. Derivative code can be fused with primal code.

Source transformation also avoids some runtime overheads of operator overloading. There is no need to allocate a dual-number object for every scalar operation. There is no need to dispatch every arithmetic operation through overloaded methods. The differentiated program can look like ordinary numeric code.

This matters for high-performance scientific computing, machine learning compilers, GPU kernels, and large numerical simulation programs.

The Basic Transformation Rules

A source transformation system needs rules for each syntactic form in the language.

For assignment:

v = expr

the transformed program computes the primal value and derivative value.

For sequencing:

stmt1
stmt2

the transformation applies to each statement in order for forward mode, and in reverse order for reverse mode.

For conditionals:

if cond:
    y = f(x)
else:
    y = g(x)

the transformed program differentiates the branch that actually executes.

Forward mode:

if cond:
    y, dy = df(x, dx)
else:
    y, dy = dg(x, dx)

Reverse mode must record which branch executed so that the reverse pass follows the same path.

For loops:

for i in range(n):
    x = body(x)

forward mode differentiates the loop body at each iteration. Reverse mode must replay the loop in reverse order, often requiring saved intermediate values from each iteration.

This creates the main systems problem in reverse mode: values needed later by the backward pass must be retained, recomputed, or checkpointed.

Source Transformation and SSA

Source transformation is easier when the program has been lowered into static single assignment form.

In SSA, each variable is assigned exactly once:

x1 = input
x2 = x1 * x1
x3 = sin(x2)
return x3

This form makes data dependencies explicit. It removes ambiguity caused by reassignment.

Reverse mode over SSA is especially clean. Each primal variable gets an adjoint variable:

bar_x3 = seed
bar_x2 += cos(x2) * bar_x3
bar_x1 += 2 * x1 * bar_x2

SSA also handles branches through phi nodes, which merge values from different control-flow paths.

if cond:
    x1 = a
else:
    x2 = b
x3 = phi(x1, x2)

The derivative transformation must propagate adjoints through the same control-flow merge, respecting which path produced the value.

Activity Analysis

Not every variable affects the output. Not every variable depends on differentiable inputs.

Source transformation systems often perform activity analysis. A variable is active if it depends on differentiable inputs and can influence differentiable outputs.

Inactive code can be left unchanged. This avoids unnecessary derivative work.

For example:

def f(x, name):
    y = x * x
    print(name)
    return y

The variable name is inactive. The print operation also has no derivative contribution. The transformed derivative program does not need a tangent or adjoint for name.

Activity analysis becomes important in large programs where only part of the computation participates in differentiation.

Handling Mutation

Mutation complicates source transformation.

A simple mathematical expression has clear data dependencies. A program with mutable arrays, references, aliasing, and in-place updates has hidden dependencies.

Example:

a[i] = a[i] + x

The derivative transformation must know whether other variables alias a, whether i changes, and whether old values of a[i] are needed during the reverse pass.

Reverse mode is especially sensitive to mutation because the backward pass often needs primal values from before mutation.

A compiler-level AD system therefore needs alias analysis, effect tracking, or a restricted programming model.

Many modern AD systems avoid unrestricted mutation in differentiated code. Others lower programs into functional or SSA-like intermediate representations before differentiation.

Generated Code Quality

The quality of source-transformed AD depends heavily on the compiler pipeline.

Naive transformation can generate bloated code:

dx = dx * y + x * dy

may be repeated many times, intermediate values may be stored unnecessarily, and reverse passes may save more primal values than needed.

A good source transformation pipeline applies optimization after differentiation:

OptimizationPurpose
Dead code eliminationRemove unused derivative computations
Common subexpression eliminationReuse repeated primal or derivative expressions
InliningExpose derivative rules across function boundaries
Loop optimizationImprove derivative loops
Memory planningReduce tape and adjoint storage
FusionCombine primal and derivative kernels
CheckpointingTrade recomputation for memory

The derivative compiler should not merely produce correct code. It should produce code that the backend compiler can optimize effectively.

Advantages

Source transformation has several practical strengths.

It can generate efficient derivative code ahead of time. It works well with optimizing compilers. It can target low-level backends such as LLVM, MLIR, CUDA, or specialized accelerator IRs. It can analyze whole programs rather than isolated operations.

It also fits naturally into compiler infrastructure. AD becomes another compiler pass, like inlining, vectorization, or lowering.

This makes source transformation suitable for systems where performance, static analysis, and deployment matter.

Limitations

Source transformation also has costs.

It requires access to program representation. It must understand the syntax and semantics of the language. It must handle control flow, mutation, aliasing, function calls, closures, exceptions, dynamic dispatch, and foreign calls.

For dynamic languages, source transformation can be difficult because the meaning of a program may depend on runtime values. Many systems therefore trace execution first, then transform the trace rather than the original source.

Reverse transformation also introduces memory pressure. The transformed program must retain or recover values needed by the backward pass. Without careful checkpointing, reverse mode can consume too much memory.

Source Transformation as Compiler AD

The compiler view of AD can be summarized as follows:

original program
    -> parsing or tracing
    -> intermediate representation
    -> activity analysis
    -> derivative transformation
    -> optimization
    -> code generation

The central design choice is the level at which differentiation occurs.

Differentiating high-level source code preserves semantic structure such as loops, arrays, and function calls. Differentiating low-level IR gives better control over memory and backend optimization. Differentiating graph IR gives a clean tensor-oriented model but may lose general programming features.

A mature AD compiler often uses several levels. It may transform high-level functions, lower them into SSA, optimize derivative code, and finally emit machine code or accelerator kernels.

Source transformation treats differentiation as a static program transformation problem. This view connects AD to compiler theory, type systems, intermediate representations, and program optimization. It is the foundation for many high-performance AD systems.