Tracing is an implementation strategy where an AD system observes a program while it runs and records the operations that occur.
Tracing is an implementation strategy where an AD system observes a program while it runs and records the operations that occur.
The result of tracing is a trace: a simplified program, tape, graph, or intermediate representation that represents one execution of the original program.
A normal function may look like this:
def f(x):
y = x * x
z = sin(y)
return zWhen executed with trace-aware values, the system records:
x0 = input
x1 = mul x0 x0
x2 = sin x1
return x2The trace can then be differentiated, optimized, compiled, cached, or replayed.
Why Tracing Exists
Source transformation needs access to source code or compiler IR. Operator overloading can compute derivatives directly but often lacks whole-program structure.
Tracing sits between them.
It lets the user write ordinary host-language code, then captures the numerical operations that actually execute. This is especially useful in dynamic languages, where static analysis of arbitrary source code is hard.
A tracing system can turn flexible user code into a more regular representation:
host-language program
-> traced execution
-> graph or IR
-> AD transform
-> optimization
-> executionThe host language remains convenient, while the traced graph gives the AD system something compiler-like to work with.
Tracer Values
Tracing usually works by replacing ordinary values with tracer values.
A tracer value carries an abstract identity rather than only a concrete number.
Tracer(id=x0, shape=(), dtype=f64)When the program applies an operation to tracer values, the operation does not merely compute a result. It records a node in the trace.
x = Tracer("x0")
y = x * xmay record:
x1 = mul x0 x0and return:
Tracer(id=x1)Thus the program keeps running, but every primitive operation extends the trace.
Concrete and Abstract Tracing
Tracing systems differ in what they know during tracing.
Concrete tracing records operations while using actual runtime values. The trace may depend on those values.
def f(x):
if x > 0:
return x * x
else:
return -xIf traced with x = 3, the trace contains only the positive branch:
x1 = mul x0 x0
return x1If traced with x = -3, the trace contains the negative branch:
x1 = neg x0
return x1Abstract tracing records operations using abstract values, such as shape and dtype, rather than exact numeric values. It can build a more general trace, but only when control flow is expressed in traceable form.
For example, tensor operations can be traced abstractly because their shapes and dtypes may be enough to determine graph structure.
Trace-Time vs Run-Time
A traced program has two phases:
trace time
run timeTrace time executes the host-language function to build an IR. Run time executes the captured IR.
This distinction is central.
Code that runs at trace time may not run again when the compiled trace runs. Side effects at trace time can therefore be surprising.
Example:
def f(x):
print("running")
return x * xThe print may occur while tracing, but not on later executions of the cached compiled trace.
This means traced code should separate numerical computation from host-language side effects.
Trace Specialization
Traces are often specialized to input properties.
Common specialization keys include:
| Specialized property | Reason |
|---|---|
| dtype | Determines primitive kernels and derivative rules |
| shape | Determines tensor layouts and reductions |
| rank | Determines broadcasting and contraction structure |
| static arguments | May control branches or loop counts |
| device | Affects lowering and execution |
| layout | Affects memory planning |
If a new call has different specialized properties, the system may retrace.
For example, a function traced for tensor<f32, [32, 128]> may require a new trace for tensor<f32, [64, 128]> unless the system supports shape polymorphism.
Control Flow Under Tracing
Ordinary host-language control flow can be traced only when it depends on trace-time values.
def f(x, flag):
if flag:
return x * x
return -xIf flag is a static argument, tracing selects one branch and records it.
But if a branch depends on a traced tensor value, the host language usually cannot decide which branch to take.
def f(x):
if sum(x) > 0:
return x * x
return -xHere sum(x) > 0 may be a traced value. The host-language if expects a concrete boolean, not a symbolic condition.
A tracing system handles this by requiring graph-level control flow:
y = cond(pred, then_fn, else_fn, x)or:
y = while_loop(cond_fn, body_fn, init)These constructs record control flow inside the trace instead of using the host language to choose during tracing.
Loops
Loops follow the same rule.
A loop with a static count can be unrolled during tracing:
def f(x):
for i in range(3):
x = x * x
return xThe trace becomes:
x1 = mul x0 x0
x2 = mul x1 x1
x3 = mul x2 x2
return x3A loop with a dynamic count needs a trace-level loop primitive:
while_loop(cond, body, state)Otherwise, the trace is tied to the particular number of iterations observed during tracing.
Unrolling can be efficient for small loops, but it can create huge traces for large loops. Loop primitives keep the trace compact and preserve structure for reverse-mode differentiation and checkpointing.
Tracing and Reverse Mode
Tracing is often used to build the graph needed by reverse mode.
During the forward execution, the system records primitive operations and their dependencies:
x0 = input
x1 = mul x0 x0
x2 = sin x1
return x2The AD transform then builds a backward trace:
bar_x2 = seed
bar_x1 += cos(x1) * bar_x2
bar_x0 += x0 * bar_x1
bar_x0 += x0 * bar_x1
return bar_x0For eager dynamic systems, the recorded trace may be consumed immediately by backward. For staged systems, the trace may be transformed into a reusable compiled gradient function.
Tracing and Forward Mode
Forward mode can also use tracing.
The trace may first capture the primal computation. Then a JVP transformation produces a new trace that computes primal and tangent values together.
primal trace
-> JVP transform
-> primal-plus-tangent traceThis is useful when the system wants one uniform transformation pipeline for both forward and reverse mode.
Purity Requirements
Tracing works best for pure numerical functions.
A pure function has output determined only by its input values and has no observable side effects.
Tracing becomes fragile when code performs:
printing
file I/O
network calls
mutation of global variables
random number generation
object identity checks
data-dependent Python list mutation
host-language exceptionsSome systems allow these behaviors, but their trace-time behavior must be specified. Otherwise, a function may behave differently when traced, compiled, cached, and replayed.
A common design is to restrict traced regions to array computation and require effects to be explicit primitives.
Randomness
Randomness requires special care.
If random numbers are drawn at trace time, the trace may capture a fixed random value. That is usually wrong.
A traceable random system represents randomness explicitly:
key1, key2 = split(key)
x = normal(key1, shape)The random key becomes an input and output of the computation. This makes randomness reproducible, differentiable where appropriate, and visible to the compiler.
For reverse mode, random values used in the forward pass must either be saved or regenerated from the same key.
Shape Polymorphism
Basic tracing specializes to exact shapes. This can cause retracing:
trace for [32, 128]
trace for [64, 128]
trace for [128, 128]Shape polymorphism allows dimensions to be symbolic:
tensor<f32, [B, 128]>The same trace can then apply to multiple batch sizes.
Shape polymorphism improves reuse but complicates compilation. The compiler must reason about symbolic dimensions, dynamic allocation, and shape-dependent control flow.
Caching Traces
Tracing has overhead, so systems usually cache traces.
A cache key may include:
function identity
input dtypes
input shapes
static argument values
device
compiler optionsOn a cache hit, the system skips tracing and reuses the compiled representation.
This is why changing a static argument may trigger recompilation, while changing only array contents usually does not.
Nested Tracing
AD systems often support nested transformations.
For example:
grad(jit(f))
jit(grad(f))
vmap(grad(f))
grad(grad(f))Each transformation may introduce its own tracing level.
Nested tracing must keep these levels distinct. A tracer created for one transformation should not be confused with a tracer from another transformation. Otherwise, the system can produce incorrect derivatives or leak symbolic values into the host language.
This is related to perturbation confusion in forward mode. Correct systems tag derivative levels and trace levels explicitly.
Tracing vs Taping
Tracing and taping are closely related but have different emphasis.
| Term | Emphasis |
|---|---|
| Tape | Runtime record used mainly for reverse-mode replay |
| Trace | Captured program representation used for transformation or compilation |
| Graph | Data dependency structure |
| IR | Compiler representation with formal semantics |
A tape may be ephemeral and consumed by one backward pass. A trace may be persistent, optimized, cached, and compiled.
In practice, the boundary is fluid. Many systems use a tape as a dynamic graph IR.
Failure Modes
Tracing systems have characteristic failure modes.
A value may be needed by the host language as a concrete value, but during tracing it is symbolic.
Example:
def f(x):
return range(int(x))If x is traced, int(x) cannot be computed at trace time unless x is static.
Another failure mode is accidental specialization. A branch, loop bound, or data structure size may be fixed by the tracing example, producing a trace that works only for that case.
A third failure mode is hidden side effects. A mutation performed during tracing may not occur during later executions of the compiled trace.
These failures are not incidental. They follow from the two-stage nature of tracing.
Design Criteria
A tracing system for AD should specify:
Which operations are traceable?
Which values are static?
Which values are symbolic?
Which host-language effects are allowed?
How are control flow and loops represented?
What causes retracing?
How are traces cached?
How are nested traces separated?
How are custom primitives registered?
How are errors reported when symbolic values escape?Clear answers are essential. Without them, users face behavior that appears arbitrary.
Summary
Tracing records an execution of a program into a graph, tape, or IR.
It gives AD systems a practical path between dynamic host-language programming and compiler-based optimization. The user writes ordinary code, while the system captures numerical operations for differentiation and compilation.
Tracing works best when numerical computation is pure, effects are explicit, shapes are understood, and control flow is represented at the trace level when it depends on symbolic values. Its main tradeoff is specialization: a trace represents a class of executions, not necessarily every possible execution of the original program.