# Intermediate Representations

## Intermediate Representations

An intermediate representation, or IR, is the internal program form used by a compiler or AD system after parsing and before final code generation.

For automatic differentiation, the IR is where the system decides what the program means. It is also where differentiation becomes mechanical.

A source program may contain syntax such as:

```python
def f(x):
    y = x * x
    z = sin(y)
    return z
```

An AD compiler usually lowers this into a smaller core language:

```text
x0 = input
x1 = mul(x0, x0)
x2 = sin(x1)
return x2
```

This representation removes surface syntax and exposes the operations that need derivative rules.

### Why AD Systems Use IRs

Differentiating raw source code is difficult. Real languages contain many constructs: classes, closures, exceptions, mutation, modules, operator syntax, dynamic dispatch, loops, comprehensions, and foreign calls.

An IR gives the AD system a smaller target.

Instead of writing derivative logic for every source-level construct, the compiler lowers source code into a compact set of primitives. AD rules are then defined for those primitives.

A useful AD IR makes these things explicit:

| Property | Why it matters |
|---|---|
| Operation | Selects the local derivative rule |
| Inputs and outputs | Defines data dependencies |
| Control flow | Determines which path is differentiated |
| Type and shape | Enables correct tensor derivative rules |
| Effects | Tracks mutation, I/O, randomness, and state |
| Memory lifetime | Controls saved values for reverse mode |

The IR is not merely an implementation detail. It determines what programs the AD system can differentiate reliably.

### Primal IR and Derivative IR

The original lowered program is the primal IR. It computes ordinary values.

```text
%0 = input x
%1 = mul %0, %0
%2 = sin %1
return %2
```

Forward-mode transformation augments the IR with tangent values:

```text
%0, d%0 = input x, dx
%1 = mul %0, %0
d%1 = add (mul d%0, %0), (mul %0, d%0)
%2 = sin %1
d%2 = mul (cos %1), d%1
return %2, d%2
```

Reverse-mode transformation usually produces a forward IR plus a reverse IR.

```text
forward:
    %0 = input x
    %1 = mul %0, %0
    %2 = sin %1
    save %0, %1
    return %2

reverse:
    adj%2 = seed
    adj%1 += mul (cos %1), adj%2
    adj%0 += mul %0, adj%1
    adj%0 += mul %0, adj%1
    return adj%0
```

The forward pass computes values and saves the subset needed by the reverse pass. The reverse pass consumes adjoints and saved primal values.

### IR Granularity

The granularity of an IR controls the unit of differentiation.

A low-level scalar IR may contain operations such as:

```text
add
mul
sin
load
store
branch
```

A tensor IR may contain operations such as:

```text
matmul
conv2d
softmax
layer_norm
reduce_sum
broadcast
reshape
```

A high-level scientific IR may contain operations such as:

```text
solve_linear_system
fft
ode_solve
eigendecomposition
```

Each level has tradeoffs.

| IR level | Strength | Cost |
|---|---|---|
| Scalar IR | Precise, general, close to machine code | Huge graphs for tensor programs |
| Tensor IR | Compact, good for ML workloads | Needs many shape and broadcasting rules |
| High-level IR | Preserves mathematical structure | Needs specialized derivative rules |
| Low-level memory IR | Enables allocation and layout optimization | Harder to recover mathematical intent |

A mature system often uses several IRs. It may differentiate a high-level tensor graph, lower it to loop IR, then optimize memory and code generation at a lower level.

### Operation Semantics

Every differentiable operation in the IR needs a semantic contract.

For example, `mul(x, y)` means ordinary multiplication, with local derivative rules:

```text
d(x * y) = dx * y + x * dy
```

Reverse mode uses adjoint rules:

```text
bar_x += y * bar_z
bar_y += x * bar_z
```

For tensors, the contract must include shape behavior.

Example:

```text
z = add(x, y)
```

If `y` is broadcast to match `x`, the reverse rule for `y` must reduce over the broadcast dimensions.

```text
bar_x += bar_z
bar_y += reduce_sum_to_shape(bar_z, shape(y))
```

This is why IR design must specify more than operation names. It must specify rank, shape, layout, broadcasting, aliasing, and effects.

### Type and Shape Information

AD transformation becomes safer when type and shape information are present in the IR.

A typed IR can distinguish:

```text
f64
tensor<f32, [1024, 768]>
matrix<f64, [n, n]>
sparse_csr<f32>
```

This lets the AD system select correct rules.

For instance, matrix multiplication:

```text
C = A @ B
```

has reverse rules:

```text
bar_A += bar_C @ B^T
bar_B += A^T @ bar_C
```

These rules require compatible matrix dimensions. Shape information can validate them before execution.

Typed IR also helps decide which values are differentiable. Integers, booleans, strings, handles, file descriptors, and control tokens usually do not receive tangents or adjoints.

### Control-Flow IR

Differentiating straight-line code is simple. Differentiating control flow requires explicit representation.

A control-flow IR may contain basic blocks:

```text
block entry(x):
    c = gt(x, 0)
    branch c, positive, negative

block positive:
    y = mul(x, x)
    jump exit(y)

block negative:
    y = neg(x)
    jump exit(y)

block exit(y):
    return y
```

Forward mode transforms each block while preserving the control-flow structure.

Reverse mode needs more work. It must propagate adjoints backward through the executed path. If a branch was taken in the forward pass, the reverse pass must follow the matching reverse branch.

For loops, reverse mode often needs a loop tape:

```text
forward loop:
    record iteration count
    save needed values per iteration

reverse loop:
    for iterations in reverse order:
        restore saved values
        propagate adjoints
```

This is one reason many systems lower dynamic control flow into explicit graph or region constructs before differentiation.

### Effect Representation

A pure mathematical function is easy to differentiate. A real program may read memory, mutate arrays, allocate buffers, draw random numbers, print values, or call external libraries.

An AD IR needs a way to represent effects.

Common effects include:

| Effect | AD concern |
|---|---|
| Mutation | Reverse pass may need old values |
| Aliasing | Multiple names may refer to same storage |
| Randomness | Need reproducibility or reparameterization |
| I/O | Usually nondifferentiable |
| Exceptions | May interrupt derivative control flow |
| Foreign calls | Need custom derivative rules |
| Parallel writes | Need deterministic adjoint accumulation |

Some systems avoid effects inside differentiable regions. Others track effects explicitly with tokens, memory SSA, or functional updates.

A simple array update:

```text
a[i] = v
```

may be represented functionally:

```text
a2 = scatter(a1, i, v)
```

This form is easier to differentiate because the old and new array values have distinct names.

### Custom Primitives

No IR can include every operation as a built-in primitive.

A practical AD system supports custom primitives. A custom primitive is an operation with user-provided derivative rules.

For example:

```text
primitive fft(x)
jvp rule: ...
vjp rule: ...
```

This lets the system treat `fft` as a single operation instead of expanding it into lower-level code.

Custom primitives are essential for:

```text
linear solvers
ODE solvers
random samplers
sorting-like operations
GPU kernels
database operators
external numerical libraries
```

The primitive boundary is a contract. The AD system trusts the supplied derivative rule. Incorrect custom rules produce incorrect gradients even when the compiler transformation itself is correct.

### Optimization on IR

After differentiation, the generated IR may contain redundant computation.

Example:

```text
x1 = mul x, x
dx1 = add (mul dx, x), (mul x, dx)
```

An optimizer can simplify:

```text
dx1 = mul 2, mul x, dx
```

Typical IR optimizations include:

| Optimization | Effect |
|---|---|
| Constant folding | Computes static expressions early |
| Dead code elimination | Removes unused primal or derivative values |
| Common subexpression elimination | Reuses repeated expressions |
| Algebraic simplification | Simplifies derivative formulas |
| Inlining | Exposes derivative opportunities |
| Fusion | Combines tensor kernels |
| Buffer reuse | Reduces memory allocation |
| Checkpoint planning | Trades recomputation for storage |

The IR must be designed so these optimizations are legal. For example, algebraic simplification must respect floating-point behavior, NaNs, signed zero, and overflow when exact reproducibility matters.

### IR Design Choices

A good AD IR balances generality and analyzability.

It should be small enough that differentiation rules are manageable, but expressive enough that useful programs do not become enormous too early.

It should preserve mathematical structure where that structure enables better derivatives. Lowering `solve(A, b)` into primitive loops too early loses the ability to use the efficient implicit derivative of a linear solve.

At the same time, it should expose enough low-level structure for memory planning and code generation.

The main design tension is this:

```text
high-level IR preserves meaning
low-level IR exposes execution
```

AD compilers usually need both.

### Example: Tensor Expression IR

A minimal tensor IR for AD might contain:

```text
const
parameter
add
mul
matmul
transpose
reshape
broadcast
reduce_sum
exp
log
where
call
return
```

A function:

```python
def loss(W, x, y):
    p = softmax(W @ x)
    return -sum(y * log(p))
```

may lower to:

```text
%0 = matmul W x
%1 = softmax %0
%2 = log %1
%3 = mul y %2
%4 = reduce_sum %3
%5 = neg %4
return %5
```

Reverse transformation then attaches adjoint rules to each operation. If `softmax` remains high-level, it can use a stable derivative rule. If it is lowered into `exp`, `sum`, and `div`, the derivative may be less stable unless the optimizer recognizes the pattern.

### IR as the Boundary of Correctness

The derivative produced by an AD system is only as correct as the IR semantics.

If the IR says `add` is real-number addition but the backend implements floating-point addition, there is already a semantic gap. If the IR ignores aliasing, the derivative of mutated arrays may be wrong. If the IR treats a nondeterministic operation as pure, reverse replay may become invalid.

Therefore an AD compiler needs a precise contract for its IR.

The contract should answer:

```text
What values exist?
Which values are differentiable?
Which operations are pure?
Which operations mutate state?
Which values must be saved for reverse mode?
Which transformations preserve semantics?
```

Without these answers, the AD system may work for examples but fail on large programs.

### Summary

Intermediate representations are the working language of compiler-based AD.

They reduce a full programming language into a form where differentiation, analysis, optimization, and code generation can be implemented systematically. A good IR exposes data dependencies, control flow, types, shapes, and effects. It also preserves enough mathematical structure to generate efficient derivative code.

In small AD systems, the IR may be an implicit tape or computation graph. In production systems, the IR is usually explicit, typed, optimized, and layered. The design of this IR largely determines the power and reliability of the AD system.