# Tracing Systems

## Tracing Systems

Tracing is an implementation strategy where an AD system observes a program while it runs and records the operations that occur.

The result of tracing is a trace: a simplified program, tape, graph, or intermediate representation that represents one execution of the original program.

A normal function may look like this:

```python
def f(x):
    y = x * x
    z = sin(y)
    return z
```

When executed with trace-aware values, the system records:

```text
x0 = input
x1 = mul x0 x0
x2 = sin x1
return x2
```

The trace can then be differentiated, optimized, compiled, cached, or replayed.

### Why Tracing Exists

Source transformation needs access to source code or compiler IR. Operator overloading can compute derivatives directly but often lacks whole-program structure.

Tracing sits between them.

It lets the user write ordinary host-language code, then captures the numerical operations that actually execute. This is especially useful in dynamic languages, where static analysis of arbitrary source code is hard.

A tracing system can turn flexible user code into a more regular representation:

```text
host-language program
    -> traced execution
    -> graph or IR
    -> AD transform
    -> optimization
    -> execution
```

The host language remains convenient, while the traced graph gives the AD system something compiler-like to work with.

### Tracer Values

Tracing usually works by replacing ordinary values with tracer values.

A tracer value carries an abstract identity rather than only a concrete number.

```text
Tracer(id=x0, shape=(), dtype=f64)
```

When the program applies an operation to tracer values, the operation does not merely compute a result. It records a node in the trace.

```python
x = Tracer("x0")
y = x * x
```

may record:

```text
x1 = mul x0 x0
```

and return:

```text
Tracer(id=x1)
```

Thus the program keeps running, but every primitive operation extends the trace.

### Concrete and Abstract Tracing

Tracing systems differ in what they know during tracing.

Concrete tracing records operations while using actual runtime values. The trace may depend on those values.

```python
def f(x):
    if x > 0:
        return x * x
    else:
        return -x
```

If traced with `x = 3`, the trace contains only the positive branch:

```text
x1 = mul x0 x0
return x1
```

If traced with `x = -3`, the trace contains the negative branch:

```text
x1 = neg x0
return x1
```

Abstract tracing records operations using abstract values, such as shape and dtype, rather than exact numeric values. It can build a more general trace, but only when control flow is expressed in traceable form.

For example, tensor operations can be traced abstractly because their shapes and dtypes may be enough to determine graph structure.

### Trace-Time vs Run-Time

A traced program has two phases:

```text
trace time
run time
```

Trace time executes the host-language function to build an IR. Run time executes the captured IR.

This distinction is central.

Code that runs at trace time may not run again when the compiled trace runs. Side effects at trace time can therefore be surprising.

Example:

```python
def f(x):
    print("running")
    return x * x
```

The print may occur while tracing, but not on later executions of the cached compiled trace.

This means traced code should separate numerical computation from host-language side effects.

### Trace Specialization

Traces are often specialized to input properties.

Common specialization keys include:

| Specialized property | Reason |
|---|---|
| dtype | Determines primitive kernels and derivative rules |
| shape | Determines tensor layouts and reductions |
| rank | Determines broadcasting and contraction structure |
| static arguments | May control branches or loop counts |
| device | Affects lowering and execution |
| layout | Affects memory planning |

If a new call has different specialized properties, the system may retrace.

For example, a function traced for `tensor<f32, [32, 128]>` may require a new trace for `tensor<f32, [64, 128]>` unless the system supports shape polymorphism.

### Control Flow Under Tracing

Ordinary host-language control flow can be traced only when it depends on trace-time values.

```python
def f(x, flag):
    if flag:
        return x * x
    return -x
```

If `flag` is a static argument, tracing selects one branch and records it.

But if a branch depends on a traced tensor value, the host language usually cannot decide which branch to take.

```python
def f(x):
    if sum(x) > 0:
        return x * x
    return -x
```

Here `sum(x) > 0` may be a traced value. The host-language `if` expects a concrete boolean, not a symbolic condition.

A tracing system handles this by requiring graph-level control flow:

```text
y = cond(pred, then_fn, else_fn, x)
```

or:

```text
y = while_loop(cond_fn, body_fn, init)
```

These constructs record control flow inside the trace instead of using the host language to choose during tracing.

### Loops

Loops follow the same rule.

A loop with a static count can be unrolled during tracing:

```python
def f(x):
    for i in range(3):
        x = x * x
    return x
```

The trace becomes:

```text
x1 = mul x0 x0
x2 = mul x1 x1
x3 = mul x2 x2
return x3
```

A loop with a dynamic count needs a trace-level loop primitive:

```text
while_loop(cond, body, state)
```

Otherwise, the trace is tied to the particular number of iterations observed during tracing.

Unrolling can be efficient for small loops, but it can create huge traces for large loops. Loop primitives keep the trace compact and preserve structure for reverse-mode differentiation and checkpointing.

### Tracing and Reverse Mode

Tracing is often used to build the graph needed by reverse mode.

During the forward execution, the system records primitive operations and their dependencies:

```text
x0 = input
x1 = mul x0 x0
x2 = sin x1
return x2
```

The AD transform then builds a backward trace:

```text
bar_x2 = seed
bar_x1 += cos(x1) * bar_x2
bar_x0 += x0 * bar_x1
bar_x0 += x0 * bar_x1
return bar_x0
```

For eager dynamic systems, the recorded trace may be consumed immediately by `backward`. For staged systems, the trace may be transformed into a reusable compiled gradient function.

### Tracing and Forward Mode

Forward mode can also use tracing.

The trace may first capture the primal computation. Then a JVP transformation produces a new trace that computes primal and tangent values together.

```text
primal trace
    -> JVP transform
    -> primal-plus-tangent trace
```

This is useful when the system wants one uniform transformation pipeline for both forward and reverse mode.

### Purity Requirements

Tracing works best for pure numerical functions.

A pure function has output determined only by its input values and has no observable side effects.

Tracing becomes fragile when code performs:

```text
printing
file I/O
network calls
mutation of global variables
random number generation
object identity checks
data-dependent Python list mutation
host-language exceptions
```

Some systems allow these behaviors, but their trace-time behavior must be specified. Otherwise, a function may behave differently when traced, compiled, cached, and replayed.

A common design is to restrict traced regions to array computation and require effects to be explicit primitives.

### Randomness

Randomness requires special care.

If random numbers are drawn at trace time, the trace may capture a fixed random value. That is usually wrong.

A traceable random system represents randomness explicitly:

```text
key1, key2 = split(key)
x = normal(key1, shape)
```

The random key becomes an input and output of the computation. This makes randomness reproducible, differentiable where appropriate, and visible to the compiler.

For reverse mode, random values used in the forward pass must either be saved or regenerated from the same key.

### Shape Polymorphism

Basic tracing specializes to exact shapes. This can cause retracing:

```text
trace for [32, 128]
trace for [64, 128]
trace for [128, 128]
```

Shape polymorphism allows dimensions to be symbolic:

```text
tensor<f32, [B, 128]>
```

The same trace can then apply to multiple batch sizes.

Shape polymorphism improves reuse but complicates compilation. The compiler must reason about symbolic dimensions, dynamic allocation, and shape-dependent control flow.

### Caching Traces

Tracing has overhead, so systems usually cache traces.

A cache key may include:

```text
function identity
input dtypes
input shapes
static argument values
device
compiler options
```

On a cache hit, the system skips tracing and reuses the compiled representation.

This is why changing a static argument may trigger recompilation, while changing only array contents usually does not.

### Nested Tracing

AD systems often support nested transformations.

For example:

```text
grad(jit(f))
jit(grad(f))
vmap(grad(f))
grad(grad(f))
```

Each transformation may introduce its own tracing level.

Nested tracing must keep these levels distinct. A tracer created for one transformation should not be confused with a tracer from another transformation. Otherwise, the system can produce incorrect derivatives or leak symbolic values into the host language.

This is related to perturbation confusion in forward mode. Correct systems tag derivative levels and trace levels explicitly.

### Tracing vs Taping

Tracing and taping are closely related but have different emphasis.

| Term | Emphasis |
|---|---|
| Tape | Runtime record used mainly for reverse-mode replay |
| Trace | Captured program representation used for transformation or compilation |
| Graph | Data dependency structure |
| IR | Compiler representation with formal semantics |

A tape may be ephemeral and consumed by one backward pass. A trace may be persistent, optimized, cached, and compiled.

In practice, the boundary is fluid. Many systems use a tape as a dynamic graph IR.

### Failure Modes

Tracing systems have characteristic failure modes.

A value may be needed by the host language as a concrete value, but during tracing it is symbolic.

Example:

```python
def f(x):
    return range(int(x))
```

If `x` is traced, `int(x)` cannot be computed at trace time unless `x` is static.

Another failure mode is accidental specialization. A branch, loop bound, or data structure size may be fixed by the tracing example, producing a trace that works only for that case.

A third failure mode is hidden side effects. A mutation performed during tracing may not occur during later executions of the compiled trace.

These failures are not incidental. They follow from the two-stage nature of tracing.

### Design Criteria

A tracing system for AD should specify:

```text
Which operations are traceable?
Which values are static?
Which values are symbolic?
Which host-language effects are allowed?
How are control flow and loops represented?
What causes retracing?
How are traces cached?
How are nested traces separated?
How are custom primitives registered?
How are errors reported when symbolic values escape?
```

Clear answers are essential. Without them, users face behavior that appears arbitrary.

### Summary

Tracing records an execution of a program into a graph, tape, or IR.

It gives AD systems a practical path between dynamic host-language programming and compiler-based optimization. The user writes ordinary code, while the system captures numerical operations for differentiation and compilation.

Tracing works best when numerical computation is pure, effects are explicit, shapes are understood, and control flow is represented at the trace level when it depends on symbolic values. Its main tradeoff is specialization: a trace represents a class of executions, not necessarily every possible execution of the original program.

