Purity and Side Effects

A pure computation is easier to differentiate because every output is determined only by its explicit inputs. There is no hidden state, no external mutation, and no dependence on execution history outside the function call.

For example:

y = f(x)

is pure when the same x always produces the same y, and the call changes nothing outside itself.

An impure computation may read or write external state:

counter = counter + 1
y = f(x, counter)

Now the output depends not only on x, but also on the previous value of counter. The function call also changes the state seen by later calls.

Automatic differentiation can handle many stateful computations, but purity gives the clean base model.

Pure Functions

A pure function has two main properties:

Property	Meaning
Determinism	Same inputs produce same outputs
No side effects	Evaluation changes no external state

A pure function can be treated as a mathematical map:

f : X \to Y.

AD can then construct another map:

Df : X \to L(X, Y),

where $Df(x)$ is the derivative of $f$ at $x$ .

In forward mode, the transformed function maps primal and tangent inputs to primal and tangent outputs.

In reverse mode, the transformed function maps primal inputs to outputs plus a backward function.

Side Effects

A side effect is any observable change outside the returned value.

Common examples:

Side effect	Example
Mutation	writing into an array
I/O	printing, reading files
Network access	calling an API
Database access	reading or writing rows
Randomness	sampling from RNG state
Global state	updating a counter
Exceptions	changing control flow abruptly
Concurrency	racing with another thread

Some side effects are harmless for differentiation. Others make the derivative ambiguous or impossible.

Mutation as Explicit State

Mutation can often be made differentiable by turning state into an explicit input and output.

Impure form:

state.value = state.value + x
y = state.value * state.value

Functional form:

state1 = update(state0, x)
y = state1.value * state1.value
return y, state1

Now the computation is a pure function:

(x, s_0) \mapsto (y, s_1).

AD can differentiate this larger function. The derivative describes how the output and final state change with respect to the input and initial state.

This transformation is called state threading.

State Threading

State threading passes state explicitly through the computation.

s1 = op1(x, s0)
s2 = op2(s1)
s3 = op3(s2)
y  = read(s3)

The dependency chain is explicit:

s0 -> s1 -> s2 -> s3 -> y

Reverse mode can then propagate adjoints through state transitions.

If the state is differentiable, it receives adjoints. If part of the state is discrete or external, it is marked non-differentiable.

Non-Differentiable State

Not all state has a meaningful derivative.

Examples include:

State	Why derivative is problematic
File handle	not numeric
Socket	external resource
Hash table keys	discrete
Database row identity	symbolic
Random seed	usually discrete
Thread lock	synchronization object

An AD system must separate differentiable state from non-differentiable state.

For example:

record = db.lookup(user_id)
score = model(x, record.weight)

The derivative can flow through record.weight if it is treated as numeric data. It cannot usually flow through user_id or through the database lookup itself.

I/O

Input and output operations are usually outside the differentiable core.

x = read_file("input.txt")
y = f(x)
write_file("output.txt", y)

The differentiable part is:

y = f(x)

The file operations supply and consume values, but they are not ordinarily differentiated.

A practical AD system often treats I/O as a boundary. Values crossing the boundary may be differentiable. The boundary operation itself is not.

Printing and Logging

Printing is a side effect, but usually does not affect numerical results.

print(x)
y = x * x

The derivative of y with respect to x remains:

\frac{dy}{dx} = 2x.

The print operation can be ignored by the derivative transform if it does not influence control flow or state used later.

However, if logging changes execution timing, updates counters, or writes values consumed later, it becomes semantically relevant.

Randomness

Randomness is stateful when it uses a mutable random number generator.

r = random()
y = x * r

For one execution, $r$ is a sampled constant. The derivative with respect to $x$ is:

\frac{dy}{dx} = r.

But the program is not deterministic unless the random stream is fixed.

To differentiate stochastic programs reliably, systems often use one of these strategies:

Strategy	Description
Save samples	Store random values used in forward pass
Save RNG state	Restore exact random stream
Stateless RNG	Make randomness an explicit input
Reparameterization	Express samples as differentiable transforms

A stateless form is:

r, key2 = random(key1)
y = x * r

Now the random key is explicit state.

Reparameterized Randomness

Some random computations can be expressed as deterministic functions of parameters and noise.

For a normal random variable:

z = \mu + \sigma \epsilon, \qquad \epsilon \sim N(0, 1).

For a fixed noise sample $\epsilon$ , derivatives with respect to $\mu$ and $\sigma$ are ordinary derivatives:

\frac{\partial z}{\partial \mu} = 1, \qquad \frac{\partial z}{\partial \sigma} = \epsilon.

This is widely used in variational inference and differentiable probabilistic models.

Exceptions

Exceptions are control flow with abrupt exit.

if x <= 0:
    raise Error
y = log(x)

The differentiable function is defined only on the successful execution region:

x > 0.

Inside that region:

\frac{dy}{dx} = \frac{1}{x}.

Outside the region, the program has no numeric output, so the derivative is undefined.

An AD system may propagate errors, mask invalid paths, or require the user to define behavior for invalid regions.

Global State

Global variables create hidden inputs and outputs.

scale = 2

def f(x):
    return scale * x

The function appears to take one input, but it also depends on scale.

If scale changes, the function changes.

A pure representation would pass scale explicitly:

def f(x, scale):
    return scale * x

This makes the derivative relation clear:

\frac{\partial f}{\partial x} = scale, \qquad \frac{\partial f}{\partial scale} = x.

Concurrency

Concurrent programs may have nondeterministic execution order.

thread 1: x = x + a
thread 2: x = x + b

If operations race, the final result may depend on scheduling. The derivative may also depend on scheduling.

AD systems generally require deterministic computation for reproducible gradients. Parallel numerical kernels are acceptable when they preserve well-defined semantics, but uncontrolled races make differentiation unreliable.

Even deterministic parallel reductions may have small floating point differences because addition order changes rounding.

Side Effects in Reverse Mode

Reverse mode is especially sensitive to side effects because it runs a second computation after the forward pass.

During backward execution, a system may need to:

Need	Example
Read saved primal values	activation tensors
Replay branch decisions	same path as forward pass
Reconstruct random masks	dropout
Undo or model mutation	in-place updates
Preserve external state	avoid duplicate writes

A naive backward pass that repeats side effects may be wrong.

For example, differentiating:

write_log(x)
y = x * x

should not necessarily write the log again during the backward pass.

Effect Tracking

A compiler-based AD system may track effects explicitly.

An operation can be annotated as:

Effect class	Example
Pure	`sin`, `add`, `matmul`
Read-only	reading a constant table
Mutable local	writing to a local buffer
Random	sampling
I/O	file write
External	database call
Non-differentiable	comparison, hash lookup

Effect tracking tells the AD transform which operations can be differentiated, reordered, duplicated, removed, or replayed.

Without effect information, compiler transformations can silently change program behavior.

Differentiable Core and Effectful Shell

A robust design separates the differentiable core from the effectful shell.

load data
prepare inputs

run differentiable model

save result
log metrics
update external system

The middle region is the differentiable computation. The outer region performs I/O, logging, scheduling, and orchestration.

This separation is common in numerical software because it keeps derivative semantics clear.

Custom Rules for Effectful Operations

Some effectful or opaque operations can participate in AD through custom derivative rules.

Example:

y = external_solver(x)

The solver may mutate memory internally, call libraries, or use iterative algorithms. The AD system does not need to inspect all internal steps if the user supplies a valid derivative rule:

backward(y_bar) -> x_bar

This is common for:

Operation	Custom derivative source
Linear solver	implicit differentiation
ODE solver	adjoint method
Rendering engine	analytic derivative
Database operator	relaxed or approximate derivative
Quantization	straight-through estimator

Custom rules shift responsibility to the rule author. The rule must match the operation’s mathematical behavior.

Core Idea

Purity makes automatic differentiation simple because the program behaves like a mathematical function. Side effects add hidden inputs, hidden outputs, ordering constraints, and state transitions.

AD can still work with effectful programs when the effects are explicit, isolated, recorded, or assigned custom rules. The central discipline is to make every value that influences the numeric result visible to the derivative transform.