Intermediate variables are the named values created between program inputs and program outputs. They make automatic differentiation mechanical.
Intermediate variables are the named values created between program inputs and program outputs. They make automatic differentiation mechanical.
Consider:
y = sin(x1 * x2) + exp(x2)A straight-line version is:
v1 = x1
v2 = x2
v3 = v1 * v2
v4 = sin(v3)
v5 = exp(v2)
v6 = v4 + v5
y = v6The expression has been decomposed into elementary assignments. Each assignment has one local derivative rule. AD does not need to reason about the whole expression at once.
Variables as Program State
At runtime, each intermediate variable stores a primal value. The primal value is the ordinary value computed by the original program.
For example, with and :
v1 = 2
v2 = 3
v3 = 6
v4 = sin(6)
v5 = exp(3)
v6 = sin(6) + exp(3)Automatic differentiation augments this state.
In forward mode, each variable also stores a tangent:
In reverse mode, each variable eventually receives an adjoint:
The intermediate variable gives AD a place to attach derivative information.
Naming Subexpressions
Intermediate variables name subexpressions. This prevents repeated work and gives the computation a graph structure.
Without intermediate variables:
With intermediate variables:
The variable is used as the input to . The variable is computed independently from . These dependencies determine the derivative flow.
Local Derivative Rules
Each intermediate assignment defines a local map.
For
v = a * bthe local differential is:
For
v = sin(a)the local differential is:
For
v = exp(a)the local differential is:
The AD engine applies these rules line by line. It does not need symbolic simplification.
Forward Mode View
Forward mode propagates tangents with the primal computation.
For
v3 = v1 * v2the tangent rule is:
For
v4 = sin(v3)the tangent rule is:
For
v6 = v4 + v5the tangent rule is:
The intermediate variables carry both values and tangent values through the same evaluation order.
Reverse Mode View
Reverse mode first computes all intermediate primal values. Then it walks backward and accumulates adjoints.
For
v6 = v4 + v5the reverse rule is:
For
v5 = exp(v2)the reverse rule is:
For
v4 = sin(v3)the reverse rule is:
For
v3 = v1 * v2the reverse rule is:
Intermediate variables are necessary because reverse mode needs the primal values , , and others during the backward pass.
Storage Requirements
Forward mode can often discard intermediate derivative state once it has been consumed. Reverse mode usually cannot.
Reverse mode needs enough information to replay local derivative rules backward. For each instruction, it may need:
| Stored item | Purpose |
|---|---|
| Operation code | Select the derivative rule |
| Input variable IDs | Know where adjoints flow |
| Output variable ID | Read the output adjoint |
| Input primal values | Evaluate local derivatives |
| Shape and dtype metadata | Apply tensor derivative rules |
| Alias and mutation metadata | Preserve program semantics |
This stored execution record is commonly called a tape.
Common Subexpressions
Intermediate variables also expose sharing.
Compare:
y = sin(x * x) + cos(x * x)A naive expression tree may compute twice. A straight-line program can compute it once:
v1 = x
v2 = v1 * v1
v3 = sin(v2)
v4 = cos(v2)
v5 = v3 + v4
y = v5The derivative of receives contributions from both uses:
This accumulation is central to reverse mode. When one variable is used by many later operations, its adjoint is the sum of all downstream contributions.
Single Assignment Form
AD is easiest when every intermediate variable is assigned exactly once.
Good:
v1 = x
v2 = v1 * v1
v3 = v2 + 1Harder:
v = x
v = v * v
v = v + 1The second program mutates v. To differentiate it cleanly, an AD system often converts it into single assignment form:
v1 = x
v2 = v1 * v1
v3 = v2 + 1Single assignment form makes data dependencies explicit. It also prevents ambiguity in reverse mode, where the old value of a variable may be needed after the variable has been overwritten.
Intermediates in Tensor Programs
In tensor programs, intermediate variables may be large arrays.
v1 = matmul(x, w)
v2 = add(v1, b)
v3 = relu(v2)
v4 = matmul(v3, u)
y = loss(v4, target)Here, each may contain millions of numbers. Reverse mode often stores these tensors because the backward pass needs them.
For example, ReLU requires knowing which entries were positive:
v3 = relu(v2)The backward rule is:
The mask depends on the primal value . The system can either store , store a compressed mask, or recompute during backward execution.
Lifetime of Intermediate Variables
An intermediate variable has a lifetime.
In the forward computation, its lifetime begins when it is computed. It ends when no later operation needs it.
In reverse mode, the lifetime may extend much longer because the backward pass may need the primal value.
This creates a memory problem. Large AD systems must decide which intermediates to store, which to recompute, and which to discard. This is the basis of checkpointing.
Minimal Implementation Model
A small AD engine can represent intermediate variables as integer IDs.
type VarID int
type Value struct {
Primal float64
Tangent float64
}A forward-mode multiplication rule can be written as:
func mul(a, b Value) Value {
return Value{
Primal: a.Primal * b.Primal,
Tangent: a.Tangent*b.Primal + a.Primal*b.Tangent,
}
}For reverse mode, the variable needs an adjoint slot:
type Node struct {
Primal float64
Adj float64
Prev []VarID
Backward func(outAdj float64)
}A multiplication node records enough information to propagate gradients backward:
func mul(tape *[]Node, a, b VarID) VarID {
nodes := *tape
av := nodes[a].Primal
bv := nodes[b].Primal
out := VarID(len(nodes))
nodes = append(nodes, Node{
Primal: av * bv,
Prev: []VarID{a, b},
Backward: func(outAdj float64) {
nodes[a].Adj += outAdj * bv
nodes[b].Adj += outAdj * av
},
})
*tape = nodes
return out
}This simplified code shows the idea but omits important engineering details, especially closure capture, mutation safety, tensor storage, and concurrency.
Design Rule
Intermediate variables should make dependencies explicit.
A good AD representation answers four questions for each value:
| Question | Example |
|---|---|
| How was this value computed? | v3 = mul(v1, v2) |
| Which values does it depend on? | v1, v2 |
| Which later values use it? | v4, v7 |
| What derivative rule applies? | product rule |
Once these questions are explicit, automatic differentiation becomes an execution discipline rather than a symbolic manipulation problem.
Core Idea
Intermediate variables are the handles by which AD controls a computation. They store primal values, expose dependencies, carry tangents in forward mode, receive adjoints in reverse mode, and define the storage requirements of the derivative computation.
A program without explicit intermediates may look compact to a human. A program with explicit intermediates is easier for an AD system to evaluate, transform, store, and differentiate.