Skip to content

Purity and Side Effects

A pure computation is easier to differentiate because every output is determined only by its explicit inputs. There is no hidden state, no external mutation, and no dependence...

A pure computation is easier to differentiate because every output is determined only by its explicit inputs. There is no hidden state, no external mutation, and no dependence on execution history outside the function call.

For example:

y = f(x)

is pure when the same x always produces the same y, and the call changes nothing outside itself.

An impure computation may read or write external state:

counter = counter + 1
y = f(x, counter)

Now the output depends not only on x, but also on the previous value of counter. The function call also changes the state seen by later calls.

Automatic differentiation can handle many stateful computations, but purity gives the clean base model.

Pure Functions

A pure function has two main properties:

PropertyMeaning
DeterminismSame inputs produce same outputs
No side effectsEvaluation changes no external state

A pure function can be treated as a mathematical map:

f:XY. f : X \to Y.

AD can then construct another map:

Df:XL(X,Y), Df : X \to L(X, Y),

where Df(x)Df(x) is the derivative of ff at xx.

In forward mode, the transformed function maps primal and tangent inputs to primal and tangent outputs.

In reverse mode, the transformed function maps primal inputs to outputs plus a backward function.

Side Effects

A side effect is any observable change outside the returned value.

Common examples:

Side effectExample
Mutationwriting into an array
I/Oprinting, reading files
Network accesscalling an API
Database accessreading or writing rows
Randomnesssampling from RNG state
Global stateupdating a counter
Exceptionschanging control flow abruptly
Concurrencyracing with another thread

Some side effects are harmless for differentiation. Others make the derivative ambiguous or impossible.

Mutation as Explicit State

Mutation can often be made differentiable by turning state into an explicit input and output.

Impure form:

state.value = state.value + x
y = state.value * state.value

Functional form:

state1 = update(state0, x)
y = state1.value * state1.value
return y, state1

Now the computation is a pure function:

(x,s0)(y,s1). (x, s_0) \mapsto (y, s_1).

AD can differentiate this larger function. The derivative describes how the output and final state change with respect to the input and initial state.

This transformation is called state threading.

State Threading

State threading passes state explicitly through the computation.

s1 = op1(x, s0)
s2 = op2(s1)
s3 = op3(s2)
y  = read(s3)

The dependency chain is explicit:

s0 -> s1 -> s2 -> s3 -> y

Reverse mode can then propagate adjoints through state transitions.

If the state is differentiable, it receives adjoints. If part of the state is discrete or external, it is marked non-differentiable.

Non-Differentiable State

Not all state has a meaningful derivative.

Examples include:

StateWhy derivative is problematic
File handlenot numeric
Socketexternal resource
Hash table keysdiscrete
Database row identitysymbolic
Random seedusually discrete
Thread locksynchronization object

An AD system must separate differentiable state from non-differentiable state.

For example:

record = db.lookup(user_id)
score = model(x, record.weight)

The derivative can flow through record.weight if it is treated as numeric data. It cannot usually flow through user_id or through the database lookup itself.

I/O

Input and output operations are usually outside the differentiable core.

x = read_file("input.txt")
y = f(x)
write_file("output.txt", y)

The differentiable part is:

y = f(x)

The file operations supply and consume values, but they are not ordinarily differentiated.

A practical AD system often treats I/O as a boundary. Values crossing the boundary may be differentiable. The boundary operation itself is not.

Printing and Logging

Printing is a side effect, but usually does not affect numerical results.

print(x)
y = x * x

The derivative of y with respect to x remains:

dydx=2x. \frac{dy}{dx} = 2x.

The print operation can be ignored by the derivative transform if it does not influence control flow or state used later.

However, if logging changes execution timing, updates counters, or writes values consumed later, it becomes semantically relevant.

Randomness

Randomness is stateful when it uses a mutable random number generator.

r = random()
y = x * r

For one execution, rr is a sampled constant. The derivative with respect to xx is:

dydx=r. \frac{dy}{dx} = r.

But the program is not deterministic unless the random stream is fixed.

To differentiate stochastic programs reliably, systems often use one of these strategies:

StrategyDescription
Save samplesStore random values used in forward pass
Save RNG stateRestore exact random stream
Stateless RNGMake randomness an explicit input
ReparameterizationExpress samples as differentiable transforms

A stateless form is:

r, key2 = random(key1)
y = x * r

Now the random key is explicit state.

Reparameterized Randomness

Some random computations can be expressed as deterministic functions of parameters and noise.

For a normal random variable:

z=μ+σϵ,ϵN(0,1). z = \mu + \sigma \epsilon, \qquad \epsilon \sim N(0, 1).

For a fixed noise sample ϵ\epsilon, derivatives with respect to μ\mu and σ\sigma are ordinary derivatives:

zμ=1,zσ=ϵ. \frac{\partial z}{\partial \mu} = 1, \qquad \frac{\partial z}{\partial \sigma} = \epsilon.

This is widely used in variational inference and differentiable probabilistic models.

Exceptions

Exceptions are control flow with abrupt exit.

if x <= 0:
    raise Error
y = log(x)

The differentiable function is defined only on the successful execution region:

x>0. x > 0.

Inside that region:

dydx=1x. \frac{dy}{dx} = \frac{1}{x}.

Outside the region, the program has no numeric output, so the derivative is undefined.

An AD system may propagate errors, mask invalid paths, or require the user to define behavior for invalid regions.

Global State

Global variables create hidden inputs and outputs.

scale = 2

def f(x):
    return scale * x

The function appears to take one input, but it also depends on scale.

If scale changes, the function changes.

A pure representation would pass scale explicitly:

def f(x, scale):
    return scale * x

This makes the derivative relation clear:

fx=scale,fscale=x. \frac{\partial f}{\partial x} = scale, \qquad \frac{\partial f}{\partial scale} = x.

Concurrency

Concurrent programs may have nondeterministic execution order.

thread 1: x = x + a
thread 2: x = x + b

If operations race, the final result may depend on scheduling. The derivative may also depend on scheduling.

AD systems generally require deterministic computation for reproducible gradients. Parallel numerical kernels are acceptable when they preserve well-defined semantics, but uncontrolled races make differentiation unreliable.

Even deterministic parallel reductions may have small floating point differences because addition order changes rounding.

Side Effects in Reverse Mode

Reverse mode is especially sensitive to side effects because it runs a second computation after the forward pass.

During backward execution, a system may need to:

NeedExample
Read saved primal valuesactivation tensors
Replay branch decisionssame path as forward pass
Reconstruct random masksdropout
Undo or model mutationin-place updates
Preserve external stateavoid duplicate writes

A naive backward pass that repeats side effects may be wrong.

For example, differentiating:

write_log(x)
y = x * x

should not necessarily write the log again during the backward pass.

Effect Tracking

A compiler-based AD system may track effects explicitly.

An operation can be annotated as:

Effect classExample
Puresin, add, matmul
Read-onlyreading a constant table
Mutable localwriting to a local buffer
Randomsampling
I/Ofile write
Externaldatabase call
Non-differentiablecomparison, hash lookup

Effect tracking tells the AD transform which operations can be differentiated, reordered, duplicated, removed, or replayed.

Without effect information, compiler transformations can silently change program behavior.

Differentiable Core and Effectful Shell

A robust design separates the differentiable core from the effectful shell.

load data
prepare inputs

run differentiable model

save result
log metrics
update external system

The middle region is the differentiable computation. The outer region performs I/O, logging, scheduling, and orchestration.

This separation is common in numerical software because it keeps derivative semantics clear.

Custom Rules for Effectful Operations

Some effectful or opaque operations can participate in AD through custom derivative rules.

Example:

y = external_solver(x)

The solver may mutate memory internally, call libraries, or use iterative algorithms. The AD system does not need to inspect all internal steps if the user supplies a valid derivative rule:

backward(y_bar) -> x_bar

This is common for:

OperationCustom derivative source
Linear solverimplicit differentiation
ODE solveradjoint method
Rendering engineanalytic derivative
Database operatorrelaxed or approximate derivative
Quantizationstraight-through estimator

Custom rules shift responsibility to the rule author. The rule must match the operation’s mathematical behavior.

Core Idea

Purity makes automatic differentiation simple because the program behaves like a mathematical function. Side effects add hidden inputs, hidden outputs, ordering constraints, and state transitions.

AD can still work with effectful programs when the effects are explicit, isolated, recorded, or assigned custom rules. The central discipline is to make every value that influences the numeric result visible to the derivative transform.