Intermediate Representations

An intermediate representation, or IR, is the internal program form used by a compiler or AD system after parsing and before final code generation.

For automatic differentiation, the IR is where the system decides what the program means. It is also where differentiation becomes mechanical.

A source program may contain syntax such as:

def f(x):
    y = x * x
    z = sin(y)
    return z

An AD compiler usually lowers this into a smaller core language:

x0 = input
x1 = mul(x0, x0)
x2 = sin(x1)
return x2

This representation removes surface syntax and exposes the operations that need derivative rules.

Why AD Systems Use IRs

Differentiating raw source code is difficult. Real languages contain many constructs: classes, closures, exceptions, mutation, modules, operator syntax, dynamic dispatch, loops, comprehensions, and foreign calls.

An IR gives the AD system a smaller target.

Instead of writing derivative logic for every source-level construct, the compiler lowers source code into a compact set of primitives. AD rules are then defined for those primitives.

A useful AD IR makes these things explicit:

Property	Why it matters
Operation	Selects the local derivative rule
Inputs and outputs	Defines data dependencies
Control flow	Determines which path is differentiated
Type and shape	Enables correct tensor derivative rules
Effects	Tracks mutation, I/O, randomness, and state
Memory lifetime	Controls saved values for reverse mode

The IR is not merely an implementation detail. It determines what programs the AD system can differentiate reliably.

Primal IR and Derivative IR

The original lowered program is the primal IR. It computes ordinary values.

%0 = input x
%1 = mul %0, %0
%2 = sin %1
return %2

Forward-mode transformation augments the IR with tangent values:

%0, d%0 = input x, dx
%1 = mul %0, %0
d%1 = add (mul d%0, %0), (mul %0, d%0)
%2 = sin %1
d%2 = mul (cos %1), d%1
return %2, d%2

Reverse-mode transformation usually produces a forward IR plus a reverse IR.

forward:
    %0 = input x
    %1 = mul %0, %0
    %2 = sin %1
    save %0, %1
    return %2

reverse:
    adj%2 = seed
    adj%1 += mul (cos %1), adj%2
    adj%0 += mul %0, adj%1
    adj%0 += mul %0, adj%1
    return adj%0

The forward pass computes values and saves the subset needed by the reverse pass. The reverse pass consumes adjoints and saved primal values.

IR Granularity

The granularity of an IR controls the unit of differentiation.

A low-level scalar IR may contain operations such as:

add
mul
sin
load
store
branch

A tensor IR may contain operations such as:

matmul
conv2d
softmax
layer_norm
reduce_sum
broadcast
reshape

A high-level scientific IR may contain operations such as:

solve_linear_system
fft
ode_solve
eigendecomposition

Each level has tradeoffs.

IR level	Strength	Cost
Scalar IR	Precise, general, close to machine code	Huge graphs for tensor programs
Tensor IR	Compact, good for ML workloads	Needs many shape and broadcasting rules
High-level IR	Preserves mathematical structure	Needs specialized derivative rules
Low-level memory IR	Enables allocation and layout optimization	Harder to recover mathematical intent

A mature system often uses several IRs. It may differentiate a high-level tensor graph, lower it to loop IR, then optimize memory and code generation at a lower level.

Operation Semantics

Every differentiable operation in the IR needs a semantic contract.

For example, mul(x, y) means ordinary multiplication, with local derivative rules:

d(x * y) = dx * y + x * dy

Reverse mode uses adjoint rules:

bar_x += y * bar_z
bar_y += x * bar_z

For tensors, the contract must include shape behavior.

Example:

z = add(x, y)

If y is broadcast to match x, the reverse rule for y must reduce over the broadcast dimensions.

bar_x += bar_z
bar_y += reduce_sum_to_shape(bar_z, shape(y))

This is why IR design must specify more than operation names. It must specify rank, shape, layout, broadcasting, aliasing, and effects.

Type and Shape Information

AD transformation becomes safer when type and shape information are present in the IR.

A typed IR can distinguish:

f64
tensor<f32, [1024, 768]>
matrix<f64, [n, n]>
sparse_csr<f32>

This lets the AD system select correct rules.

For instance, matrix multiplication:

C = A @ B

has reverse rules:

bar_A += bar_C @ B^T
bar_B += A^T @ bar_C

These rules require compatible matrix dimensions. Shape information can validate them before execution.

Typed IR also helps decide which values are differentiable. Integers, booleans, strings, handles, file descriptors, and control tokens usually do not receive tangents or adjoints.

Control-Flow IR

Differentiating straight-line code is simple. Differentiating control flow requires explicit representation.

A control-flow IR may contain basic blocks:

block entry(x):
    c = gt(x, 0)
    branch c, positive, negative

block positive:
    y = mul(x, x)
    jump exit(y)

block negative:
    y = neg(x)
    jump exit(y)

block exit(y):
    return y

Forward mode transforms each block while preserving the control-flow structure.

Reverse mode needs more work. It must propagate adjoints backward through the executed path. If a branch was taken in the forward pass, the reverse pass must follow the matching reverse branch.

For loops, reverse mode often needs a loop tape:

forward loop:
    record iteration count
    save needed values per iteration

reverse loop:
    for iterations in reverse order:
        restore saved values
        propagate adjoints

This is one reason many systems lower dynamic control flow into explicit graph or region constructs before differentiation.

Effect Representation

A pure mathematical function is easy to differentiate. A real program may read memory, mutate arrays, allocate buffers, draw random numbers, print values, or call external libraries.

An AD IR needs a way to represent effects.

Common effects include:

Effect	AD concern
Mutation	Reverse pass may need old values
Aliasing	Multiple names may refer to same storage
Randomness	Need reproducibility or reparameterization
I/O	Usually nondifferentiable
Exceptions	May interrupt derivative control flow
Foreign calls	Need custom derivative rules
Parallel writes	Need deterministic adjoint accumulation

Some systems avoid effects inside differentiable regions. Others track effects explicitly with tokens, memory SSA, or functional updates.

A simple array update:

a[i] = v

may be represented functionally:

a2 = scatter(a1, i, v)

This form is easier to differentiate because the old and new array values have distinct names.

Custom Primitives

No IR can include every operation as a built-in primitive.

A practical AD system supports custom primitives. A custom primitive is an operation with user-provided derivative rules.

For example:

primitive fft(x)
jvp rule: ...
vjp rule: ...

This lets the system treat fft as a single operation instead of expanding it into lower-level code.

Custom primitives are essential for:

linear solvers
ODE solvers
random samplers
sorting-like operations
GPU kernels
database operators
external numerical libraries

The primitive boundary is a contract. The AD system trusts the supplied derivative rule. Incorrect custom rules produce incorrect gradients even when the compiler transformation itself is correct.

Optimization on IR

After differentiation, the generated IR may contain redundant computation.

Example:

x1 = mul x, x
dx1 = add (mul dx, x), (mul x, dx)

An optimizer can simplify:

dx1 = mul 2, mul x, dx

Typical IR optimizations include:

Optimization	Effect
Constant folding	Computes static expressions early
Dead code elimination	Removes unused primal or derivative values
Common subexpression elimination	Reuses repeated expressions
Algebraic simplification	Simplifies derivative formulas
Inlining	Exposes derivative opportunities
Fusion	Combines tensor kernels
Buffer reuse	Reduces memory allocation
Checkpoint planning	Trades recomputation for storage

The IR must be designed so these optimizations are legal. For example, algebraic simplification must respect floating-point behavior, NaNs, signed zero, and overflow when exact reproducibility matters.

IR Design Choices

A good AD IR balances generality and analyzability.

It should be small enough that differentiation rules are manageable, but expressive enough that useful programs do not become enormous too early.

It should preserve mathematical structure where that structure enables better derivatives. Lowering solve(A, b) into primitive loops too early loses the ability to use the efficient implicit derivative of a linear solve.

At the same time, it should expose enough low-level structure for memory planning and code generation.

The main design tension is this:

high-level IR preserves meaning
low-level IR exposes execution

AD compilers usually need both.

Example: Tensor Expression IR

A minimal tensor IR for AD might contain:

const
parameter
add
mul
matmul
transpose
reshape
broadcast
reduce_sum
exp
log
where
call
return

A function:

def loss(W, x, y):
    p = softmax(W @ x)
    return -sum(y * log(p))

may lower to:

%0 = matmul W x
%1 = softmax %0
%2 = log %1
%3 = mul y %2
%4 = reduce_sum %3
%5 = neg %4
return %5

Reverse transformation then attaches adjoint rules to each operation. If softmax remains high-level, it can use a stable derivative rule. If it is lowered into exp, sum, and div, the derivative may be less stable unless the optimizer recognizes the pattern.

IR as the Boundary of Correctness

The derivative produced by an AD system is only as correct as the IR semantics.

If the IR says add is real-number addition but the backend implements floating-point addition, there is already a semantic gap. If the IR ignores aliasing, the derivative of mutated arrays may be wrong. If the IR treats a nondeterministic operation as pure, reverse replay may become invalid.

Therefore an AD compiler needs a precise contract for its IR.

The contract should answer:

What values exist?
Which values are differentiable?
Which operations are pure?
Which operations mutate state?
Which values must be saved for reverse mode?
Which transformations preserve semantics?

Without these answers, the AD system may work for examples but fail on large programs.

Summary

Intermediate representations are the working language of compiler-based AD.

They reduce a full programming language into a form where differentiation, analysis, optimization, and code generation can be implemented systematically. A good IR exposes data dependencies, control flow, types, shapes, and effects. It also preserves enough mathematical structure to generate efficient derivative code.

In small AD systems, the IR may be an implicit tape or computation graph. In production systems, the IR is usually explicit, typed, optimized, and layered. The design of this IR largely determines the power and reliability of the AD system.