Skip to content

Tracing Systems

Tracing is an implementation strategy where an AD system observes a program while it runs and records the operations that occur.

Tracing is an implementation strategy where an AD system observes a program while it runs and records the operations that occur.

The result of tracing is a trace: a simplified program, tape, graph, or intermediate representation that represents one execution of the original program.

A normal function may look like this:

def f(x):
    y = x * x
    z = sin(y)
    return z

When executed with trace-aware values, the system records:

x0 = input
x1 = mul x0 x0
x2 = sin x1
return x2

The trace can then be differentiated, optimized, compiled, cached, or replayed.

Why Tracing Exists

Source transformation needs access to source code or compiler IR. Operator overloading can compute derivatives directly but often lacks whole-program structure.

Tracing sits between them.

It lets the user write ordinary host-language code, then captures the numerical operations that actually execute. This is especially useful in dynamic languages, where static analysis of arbitrary source code is hard.

A tracing system can turn flexible user code into a more regular representation:

host-language program
    -> traced execution
    -> graph or IR
    -> AD transform
    -> optimization
    -> execution

The host language remains convenient, while the traced graph gives the AD system something compiler-like to work with.

Tracer Values

Tracing usually works by replacing ordinary values with tracer values.

A tracer value carries an abstract identity rather than only a concrete number.

Tracer(id=x0, shape=(), dtype=f64)

When the program applies an operation to tracer values, the operation does not merely compute a result. It records a node in the trace.

x = Tracer("x0")
y = x * x

may record:

x1 = mul x0 x0

and return:

Tracer(id=x1)

Thus the program keeps running, but every primitive operation extends the trace.

Concrete and Abstract Tracing

Tracing systems differ in what they know during tracing.

Concrete tracing records operations while using actual runtime values. The trace may depend on those values.

def f(x):
    if x > 0:
        return x * x
    else:
        return -x

If traced with x = 3, the trace contains only the positive branch:

x1 = mul x0 x0
return x1

If traced with x = -3, the trace contains the negative branch:

x1 = neg x0
return x1

Abstract tracing records operations using abstract values, such as shape and dtype, rather than exact numeric values. It can build a more general trace, but only when control flow is expressed in traceable form.

For example, tensor operations can be traced abstractly because their shapes and dtypes may be enough to determine graph structure.

Trace-Time vs Run-Time

A traced program has two phases:

trace time
run time

Trace time executes the host-language function to build an IR. Run time executes the captured IR.

This distinction is central.

Code that runs at trace time may not run again when the compiled trace runs. Side effects at trace time can therefore be surprising.

Example:

def f(x):
    print("running")
    return x * x

The print may occur while tracing, but not on later executions of the cached compiled trace.

This means traced code should separate numerical computation from host-language side effects.

Trace Specialization

Traces are often specialized to input properties.

Common specialization keys include:

Specialized propertyReason
dtypeDetermines primitive kernels and derivative rules
shapeDetermines tensor layouts and reductions
rankDetermines broadcasting and contraction structure
static argumentsMay control branches or loop counts
deviceAffects lowering and execution
layoutAffects memory planning

If a new call has different specialized properties, the system may retrace.

For example, a function traced for tensor<f32, [32, 128]> may require a new trace for tensor<f32, [64, 128]> unless the system supports shape polymorphism.

Control Flow Under Tracing

Ordinary host-language control flow can be traced only when it depends on trace-time values.

def f(x, flag):
    if flag:
        return x * x
    return -x

If flag is a static argument, tracing selects one branch and records it.

But if a branch depends on a traced tensor value, the host language usually cannot decide which branch to take.

def f(x):
    if sum(x) > 0:
        return x * x
    return -x

Here sum(x) > 0 may be a traced value. The host-language if expects a concrete boolean, not a symbolic condition.

A tracing system handles this by requiring graph-level control flow:

y = cond(pred, then_fn, else_fn, x)

or:

y = while_loop(cond_fn, body_fn, init)

These constructs record control flow inside the trace instead of using the host language to choose during tracing.

Loops

Loops follow the same rule.

A loop with a static count can be unrolled during tracing:

def f(x):
    for i in range(3):
        x = x * x
    return x

The trace becomes:

x1 = mul x0 x0
x2 = mul x1 x1
x3 = mul x2 x2
return x3

A loop with a dynamic count needs a trace-level loop primitive:

while_loop(cond, body, state)

Otherwise, the trace is tied to the particular number of iterations observed during tracing.

Unrolling can be efficient for small loops, but it can create huge traces for large loops. Loop primitives keep the trace compact and preserve structure for reverse-mode differentiation and checkpointing.

Tracing and Reverse Mode

Tracing is often used to build the graph needed by reverse mode.

During the forward execution, the system records primitive operations and their dependencies:

x0 = input
x1 = mul x0 x0
x2 = sin x1
return x2

The AD transform then builds a backward trace:

bar_x2 = seed
bar_x1 += cos(x1) * bar_x2
bar_x0 += x0 * bar_x1
bar_x0 += x0 * bar_x1
return bar_x0

For eager dynamic systems, the recorded trace may be consumed immediately by backward. For staged systems, the trace may be transformed into a reusable compiled gradient function.

Tracing and Forward Mode

Forward mode can also use tracing.

The trace may first capture the primal computation. Then a JVP transformation produces a new trace that computes primal and tangent values together.

primal trace
    -> JVP transform
    -> primal-plus-tangent trace

This is useful when the system wants one uniform transformation pipeline for both forward and reverse mode.

Purity Requirements

Tracing works best for pure numerical functions.

A pure function has output determined only by its input values and has no observable side effects.

Tracing becomes fragile when code performs:

printing
file I/O
network calls
mutation of global variables
random number generation
object identity checks
data-dependent Python list mutation
host-language exceptions

Some systems allow these behaviors, but their trace-time behavior must be specified. Otherwise, a function may behave differently when traced, compiled, cached, and replayed.

A common design is to restrict traced regions to array computation and require effects to be explicit primitives.

Randomness

Randomness requires special care.

If random numbers are drawn at trace time, the trace may capture a fixed random value. That is usually wrong.

A traceable random system represents randomness explicitly:

key1, key2 = split(key)
x = normal(key1, shape)

The random key becomes an input and output of the computation. This makes randomness reproducible, differentiable where appropriate, and visible to the compiler.

For reverse mode, random values used in the forward pass must either be saved or regenerated from the same key.

Shape Polymorphism

Basic tracing specializes to exact shapes. This can cause retracing:

trace for [32, 128]
trace for [64, 128]
trace for [128, 128]

Shape polymorphism allows dimensions to be symbolic:

tensor<f32, [B, 128]>

The same trace can then apply to multiple batch sizes.

Shape polymorphism improves reuse but complicates compilation. The compiler must reason about symbolic dimensions, dynamic allocation, and shape-dependent control flow.

Caching Traces

Tracing has overhead, so systems usually cache traces.

A cache key may include:

function identity
input dtypes
input shapes
static argument values
device
compiler options

On a cache hit, the system skips tracing and reuses the compiled representation.

This is why changing a static argument may trigger recompilation, while changing only array contents usually does not.

Nested Tracing

AD systems often support nested transformations.

For example:

grad(jit(f))
jit(grad(f))
vmap(grad(f))
grad(grad(f))

Each transformation may introduce its own tracing level.

Nested tracing must keep these levels distinct. A tracer created for one transformation should not be confused with a tracer from another transformation. Otherwise, the system can produce incorrect derivatives or leak symbolic values into the host language.

This is related to perturbation confusion in forward mode. Correct systems tag derivative levels and trace levels explicitly.

Tracing vs Taping

Tracing and taping are closely related but have different emphasis.

TermEmphasis
TapeRuntime record used mainly for reverse-mode replay
TraceCaptured program representation used for transformation or compilation
GraphData dependency structure
IRCompiler representation with formal semantics

A tape may be ephemeral and consumed by one backward pass. A trace may be persistent, optimized, cached, and compiled.

In practice, the boundary is fluid. Many systems use a tape as a dynamic graph IR.

Failure Modes

Tracing systems have characteristic failure modes.

A value may be needed by the host language as a concrete value, but during tracing it is symbolic.

Example:

def f(x):
    return range(int(x))

If x is traced, int(x) cannot be computed at trace time unless x is static.

Another failure mode is accidental specialization. A branch, loop bound, or data structure size may be fixed by the tracing example, producing a trace that works only for that case.

A third failure mode is hidden side effects. A mutation performed during tracing may not occur during later executions of the compiled trace.

These failures are not incidental. They follow from the two-stage nature of tracing.

Design Criteria

A tracing system for AD should specify:

Which operations are traceable?
Which values are static?
Which values are symbolic?
Which host-language effects are allowed?
How are control flow and loops represented?
What causes retracing?
How are traces cached?
How are nested traces separated?
How are custom primitives registered?
How are errors reported when symbolic values escape?

Clear answers are essential. Without them, users face behavior that appears arbitrary.

Summary

Tracing records an execution of a program into a graph, tape, or IR.

It gives AD systems a practical path between dynamic host-language programming and compiler-based optimization. The user writes ordinary code, while the system captures numerical operations for differentiation and compilation.

Tracing works best when numerical computation is pure, effects are explicit, shapes are understood, and control flow is represented at the trace level when it depends on symbolic values. Its main tradeoff is specialization: a trace represents a class of executions, not necessarily every possible execution of the original program.