# Purity and Side Effects

## Purity and Side Effects

A pure computation is easier to differentiate because every output is determined only by its explicit inputs. There is no hidden state, no external mutation, and no dependence on execution history outside the function call.

For example:

```text
y = f(x)
```

is pure when the same `x` always produces the same `y`, and the call changes nothing outside itself.

An impure computation may read or write external state:

```text
counter = counter + 1
y = f(x, counter)
```

Now the output depends not only on `x`, but also on the previous value of `counter`. The function call also changes the state seen by later calls.

Automatic differentiation can handle many stateful computations, but purity gives the clean base model.

## Pure Functions

A pure function has two main properties:

| Property | Meaning |
|---|---|
| Determinism | Same inputs produce same outputs |
| No side effects | Evaluation changes no external state |

A pure function can be treated as a mathematical map:

$$
f : X \to Y.
$$

AD can then construct another map:

$$
Df : X \to L(X, Y),
$$

where $Df(x)$ is the derivative of $f$ at $x$.

In forward mode, the transformed function maps primal and tangent inputs to primal and tangent outputs.

In reverse mode, the transformed function maps primal inputs to outputs plus a backward function.

## Side Effects

A side effect is any observable change outside the returned value.

Common examples:

| Side effect | Example |
|---|---|
| Mutation | writing into an array |
| I/O | printing, reading files |
| Network access | calling an API |
| Database access | reading or writing rows |
| Randomness | sampling from RNG state |
| Global state | updating a counter |
| Exceptions | changing control flow abruptly |
| Concurrency | racing with another thread |

Some side effects are harmless for differentiation. Others make the derivative ambiguous or impossible.

## Mutation as Explicit State

Mutation can often be made differentiable by turning state into an explicit input and output.

Impure form:

```text
state.value = state.value + x
y = state.value * state.value
```

Functional form:

```text
state1 = update(state0, x)
y = state1.value * state1.value
return y, state1
```

Now the computation is a pure function:

$$
(x, s_0) \mapsto (y, s_1).
$$

AD can differentiate this larger function. The derivative describes how the output and final state change with respect to the input and initial state.

This transformation is called state threading.

## State Threading

State threading passes state explicitly through the computation.

```text
s1 = op1(x, s0)
s2 = op2(s1)
s3 = op3(s2)
y  = read(s3)
```

The dependency chain is explicit:

```text
s0 -> s1 -> s2 -> s3 -> y
```

Reverse mode can then propagate adjoints through state transitions.

If the state is differentiable, it receives adjoints. If part of the state is discrete or external, it is marked non-differentiable.

## Non-Differentiable State

Not all state has a meaningful derivative.

Examples include:

| State | Why derivative is problematic |
|---|---|
| File handle | not numeric |
| Socket | external resource |
| Hash table keys | discrete |
| Database row identity | symbolic |
| Random seed | usually discrete |
| Thread lock | synchronization object |

An AD system must separate differentiable state from non-differentiable state.

For example:

```text
record = db.lookup(user_id)
score = model(x, record.weight)
```

The derivative can flow through `record.weight` if it is treated as numeric data. It cannot usually flow through `user_id` or through the database lookup itself.

## I/O

Input and output operations are usually outside the differentiable core.

```text
x = read_file("input.txt")
y = f(x)
write_file("output.txt", y)
```

The differentiable part is:

```text
y = f(x)
```

The file operations supply and consume values, but they are not ordinarily differentiated.

A practical AD system often treats I/O as a boundary. Values crossing the boundary may be differentiable. The boundary operation itself is not.

## Printing and Logging

Printing is a side effect, but usually does not affect numerical results.

```text
print(x)
y = x * x
```

The derivative of `y` with respect to `x` remains:

$$
\frac{dy}{dx} = 2x.
$$

The print operation can be ignored by the derivative transform if it does not influence control flow or state used later.

However, if logging changes execution timing, updates counters, or writes values consumed later, it becomes semantically relevant.

## Randomness

Randomness is stateful when it uses a mutable random number generator.

```text
r = random()
y = x * r
```

For one execution, $r$ is a sampled constant. The derivative with respect to $x$ is:

$$
\frac{dy}{dx} = r.
$$

But the program is not deterministic unless the random stream is fixed.

To differentiate stochastic programs reliably, systems often use one of these strategies:

| Strategy | Description |
|---|---|
| Save samples | Store random values used in forward pass |
| Save RNG state | Restore exact random stream |
| Stateless RNG | Make randomness an explicit input |
| Reparameterization | Express samples as differentiable transforms |

A stateless form is:

```text
r, key2 = random(key1)
y = x * r
```

Now the random key is explicit state.

## Reparameterized Randomness

Some random computations can be expressed as deterministic functions of parameters and noise.

For a normal random variable:

$$
z = \mu + \sigma \epsilon,
\qquad
\epsilon \sim N(0, 1).
$$

For a fixed noise sample $\epsilon$, derivatives with respect to $\mu$ and $\sigma$ are ordinary derivatives:

$$
\frac{\partial z}{\partial \mu} = 1,
\qquad
\frac{\partial z}{\partial \sigma} = \epsilon.
$$

This is widely used in variational inference and differentiable probabilistic models.

## Exceptions

Exceptions are control flow with abrupt exit.

```text
if x <= 0:
    raise Error
y = log(x)
```

The differentiable function is defined only on the successful execution region:

$$
x > 0.
$$

Inside that region:

$$
\frac{dy}{dx} = \frac{1}{x}.
$$

Outside the region, the program has no numeric output, so the derivative is undefined.

An AD system may propagate errors, mask invalid paths, or require the user to define behavior for invalid regions.

## Global State

Global variables create hidden inputs and outputs.

```text
scale = 2

def f(x):
    return scale * x
```

The function appears to take one input, but it also depends on `scale`.

If `scale` changes, the function changes.

A pure representation would pass `scale` explicitly:

```text
def f(x, scale):
    return scale * x
```

This makes the derivative relation clear:

$$
\frac{\partial f}{\partial x} = scale,
\qquad
\frac{\partial f}{\partial scale} = x.
$$

## Concurrency

Concurrent programs may have nondeterministic execution order.

```text
thread 1: x = x + a
thread 2: x = x + b
```

If operations race, the final result may depend on scheduling. The derivative may also depend on scheduling.

AD systems generally require deterministic computation for reproducible gradients. Parallel numerical kernels are acceptable when they preserve well-defined semantics, but uncontrolled races make differentiation unreliable.

Even deterministic parallel reductions may have small floating point differences because addition order changes rounding.

## Side Effects in Reverse Mode

Reverse mode is especially sensitive to side effects because it runs a second computation after the forward pass.

During backward execution, a system may need to:

| Need | Example |
|---|---|
| Read saved primal values | activation tensors |
| Replay branch decisions | same path as forward pass |
| Reconstruct random masks | dropout |
| Undo or model mutation | in-place updates |
| Preserve external state | avoid duplicate writes |

A naive backward pass that repeats side effects may be wrong.

For example, differentiating:

```text
write_log(x)
y = x * x
```

should not necessarily write the log again during the backward pass.

## Effect Tracking

A compiler-based AD system may track effects explicitly.

An operation can be annotated as:

| Effect class | Example |
|---|---|
| Pure | `sin`, `add`, `matmul` |
| Read-only | reading a constant table |
| Mutable local | writing to a local buffer |
| Random | sampling |
| I/O | file write |
| External | database call |
| Non-differentiable | comparison, hash lookup |

Effect tracking tells the AD transform which operations can be differentiated, reordered, duplicated, removed, or replayed.

Without effect information, compiler transformations can silently change program behavior.

## Differentiable Core and Effectful Shell

A robust design separates the differentiable core from the effectful shell.

```text
load data
prepare inputs

run differentiable model

save result
log metrics
update external system
```

The middle region is the differentiable computation. The outer region performs I/O, logging, scheduling, and orchestration.

This separation is common in numerical software because it keeps derivative semantics clear.

## Custom Rules for Effectful Operations

Some effectful or opaque operations can participate in AD through custom derivative rules.

Example:

```text
y = external_solver(x)
```

The solver may mutate memory internally, call libraries, or use iterative algorithms. The AD system does not need to inspect all internal steps if the user supplies a valid derivative rule:

```text
backward(y_bar) -> x_bar
```

This is common for:

| Operation | Custom derivative source |
|---|---|
| Linear solver | implicit differentiation |
| ODE solver | adjoint method |
| Rendering engine | analytic derivative |
| Database operator | relaxed or approximate derivative |
| Quantization | straight-through estimator |

Custom rules shift responsibility to the rule author. The rule must match the operation's mathematical behavior.

## Core Idea

Purity makes automatic differentiation simple because the program behaves like a mathematical function. Side effects add hidden inputs, hidden outputs, ordering constraints, and state transitions.

AD can still work with effectful programs when the effects are explicit, isolated, recorded, or assigned custom rules. The central discipline is to make every value that influences the numeric result visible to the derivative transform.

