# Effect Systems and Mutation

## Effect Systems and Mutation

Automatic differentiation is easiest to define for pure functions. A pure function behaves like a mathematical mapping: it consumes inputs, produces outputs, and has no observable side effects. Real programs are rarely this simple. They mutate arrays, allocate memory, read state, launch kernels, perform I/O, sample random values, and synchronize across devices.

An effect system describes these behaviors explicitly. It tracks what a computation does in addition to what value it returns. This becomes important for automatic differentiation because effects change the meaning of the derivative transformation.

### Pure Computation and the Chain Rule

Consider a pure function:

$$
f : X \rightarrow Y
$$

The derivative transformation:

$$
D(f)
$$

can be defined compositionally.

If:

$$
f = h \circ g
$$

then:

$$
D(f) = D(h) \circ D(g)
$$

$$
D(f)=D(h) \circ D(g)
$$

This works because the computation has no hidden state. Every dependency is visible in the expression graph.

Mutation breaks this simplicity because values can change over time.

### Mutation Changes Semantics

Suppose:

```python
x = [1.0, 2.0]
y = x
x[0] = 5.0
```

Now `y` also changes because `x` and `y` alias the same storage.

A derivative system must answer:

- Which version of `x` does the gradient refer to?
- Was the original value needed later?
- Should the reverse pass reconstruct old state?
- Is the mutation observable outside the differentiated region?

These are semantic questions, not only implementation details.

### Effects as Typed Behavior

An effect system augments function types with behavioral information.

Instead of:

```text
f : A -> B
```

the language may track:

```text
f : A -> B [Pure]
```

or:

```text
g : A -> B [State]
```

Possible effects include:

| Effect | Meaning |
|---|---|
| Pure | No observable side effects |
| State | Mutation of memory |
| IO | External interaction |
| Random | Random number generation |
| Exception | Non-local control flow |
| Async | Concurrent execution |
| Device | GPU or accelerator execution |

Differentiation semantics depend strongly on these effects.

### Mutation in Forward Mode

Forward mode is relatively tolerant of mutation because derivatives propagate alongside primal execution.

Example:

```python
x = Dual(2.0, 1.0)
x = x * x
```

The primal and tangent update together.

Problems appear when:

- Aliasing exists
- Mutation order matters
- Shared buffers are updated
- State escapes the local computation

Even then, forward mode usually follows ordinary execution naturally.

### Mutation in Reverse Mode

Reverse mode is harder because the backward pass occurs after forward execution.

Suppose:

```python
x = 3
x = x * 2
x = x + 1
```

The reverse pass must propagate gradients through earlier states of `x`.

Conceptually:

| Step | Primal value |
|---|---|
| Initial | 3 |
| After multiply | 6 |
| After add | 7 |

Backward propagation needs access to the intermediate value `6`.

If the old value was overwritten, reverse mode may fail unless the system saved enough information.

### The State Reconstruction Problem

Reverse mode often needs to reconstruct earlier program state.

This creates the central problem of mutation-aware AD:

| Requirement | Reason |
|---|---|
| Preserve overwritten values | Needed for derivative rules |
| Preserve execution order | Reverse pass depends on it |
| Preserve aliasing semantics | Shared state affects gradients |
| Preserve control decisions | Branches determine path |

Several implementation strategies exist.

### Tape-Based State Saving

A tape records operations and saved values during the forward pass.

Example:

```text
store old x = 3
x = x * 2

store old x = 6
x = x + 1
```

The backward pass replays derivative rules using saved intermediates.

Advantages:

| Benefit | Explanation |
|---|---|
| Correct reverse semantics | Exact replay |
| Dynamic control flow | Works naturally |
| Flexible mutation | Handles arbitrary updates |

Costs:

| Cost | Explanation |
|---|---|
| Memory overhead | Stores many intermediates |
| Allocation pressure | Tape growth |
| Runtime overhead | Recording operations |

Large tensor systems can spend more memory on saved activations than on parameters themselves.

### Functionalization

Functionalization rewrites mutation into immutable updates.

Instead of:

```python
x[i] += y
```

the system internally rewrites:

```python
x2 = update(x, i, x[i] + y)
```

This restores functional semantics.

Advantages:

| Benefit | Explanation |
|---|---|
| Simpler AD rules | Pure transformations |
| Easier optimization | Explicit dataflow |
| Safer parallelism | No hidden state |

Disadvantages:

| Cost | Explanation |
|---|---|
| Extra allocations | Immutable copies |
| Loss of locality | More memory movement |
| Potential slowdown | Large tensor duplication |

Modern systems often functionalize internally while exposing mutable syntax externally.

### Versioned Tensors

Some frameworks track tensor versions.

Example:

```python
x += y
```

increments an internal version counter.

If reverse mode later needs the old tensor value, the system checks whether the tensor was modified unsafely.

This detects invalid gradient computations.

Version tracking is common in eager tensor frameworks because it preserves user-facing imperative syntax while protecting reverse-mode correctness.

### Linear and Ownership Types

Ownership systems help manage mutation safely.

A linearly owned value has exactly one owner:

```text
x : Linear<Tensor>
```

If ownership is unique, mutation becomes safer because no aliases exist.

Advantages for AD:

| Property | Benefit |
|---|---|
| No hidden aliases | Easier adjoint reasoning |
| Safe destructive updates | Efficient kernels |
| Better memory reuse | Reduced allocations |
| Predictable lifetimes | Easier checkpointing |

Languages with ownership systems can optimize reverse mode aggressively because they know when values can be safely reused or overwritten.

### Mutation in Tensor Systems

Tensor frameworks rely heavily on mutation internally even when the user API appears functional.

Examples include:

- In-place GPU kernels
- Gradient accumulation buffers
- Optimizer updates
- Memory pooling
- Buffer reuse

An optimizer step:

```python
w -= lr * grad
```

mutates parameter storage.

The AD system must ensure this occurs outside the differentiated region or treat it explicitly as stateful computation.

### Effects and Control Flow

Effects interact with branching and loops.

Example:

```python
if x > 0:
    state += 1
else:
    state -= 1
```

The backward pass depends on:

- Which branch executed
- What state existed before mutation
- Whether state affects future computation

Differentiating such programs requires preserving both control-flow structure and state transitions.

### Randomness and Probabilistic Effects

Random sampling is another important effect.

Example:

```python
z = sample_normal(mu, sigma)
```

Sampling is not ordinarily differentiable because the output changes discontinuously with the random seed.

Several approaches exist:

| Method | Idea |
|---|---|
| Reparameterization | Move randomness outside differentiable path |
| Score-function estimators | Differentiate expectation instead |
| Straight-through estimators | Approximate backward behavior |
| Deterministic replay | Reuse random seeds |

An effect system can track random behavior explicitly:

```text
f : A -> B [Random]
```

This prevents the compiler from incorrectly treating stochastic computations as pure.

### I/O and External State

I/O effects usually lie outside derivative semantics.

Example:

```python
print(loss)
save_model(weights)
```

These operations do not contribute mathematical derivatives.

However, external state can still influence computation:

| Effect | Problem |
|---|---|
| Reading files | Input changes across executions |
| Global variables | Hidden dependencies |
| Device synchronization | Timing-sensitive behavior |
| Distributed state | Non-local mutation |

Differentiable regions are therefore often restricted to pure numerical subgraphs.

### Effect Systems for AD Compilers

An AD compiler benefits from explicit effect tracking.

The compiler can answer:

| Question | Why it matters |
|---|---|
| Is this function pure? | Safe for source transformation |
| Does it mutate memory? | Requires state preservation |
| Does randomness occur? | Requires stochastic semantics |
| Can operations reorder? | Important for reverse correctness |
| Can buffers be reused? | Memory optimization |

Effect-aware optimization is increasingly important in large differentiable systems.

### Checkpointing and Effects

Checkpointing trades memory for recomputation.

Instead of storing all intermediates:

1. Save selected checkpoints.
2. Recompute missing values during reverse mode.

This works best for pure computations.

If mutation or randomness occurs, recomputation may not reproduce the original execution unless:

- State is restored
- Random seeds are replayed
- External effects are isolated

Effects therefore constrain checkpointing strategies.

### Parallelism and Concurrency

Concurrent mutation complicates AD further.

Example:

```python
parallel_for(...)
```

Multiple threads may update shared adjoints.

The AD system must define:

| Issue | Requirement |
|---|---|
| Adjoint accumulation | Associative reduction |
| Ordering | Deterministic or nondeterministic |
| Synchronization | Correct race-free updates |
| Distributed state | Cross-device aggregation |

Effect systems can describe concurrent regions explicitly.

### Referential Transparency Recovery

Many AD systems attempt to recover functional semantics internally even if the surface language is imperative.

This often involves:

| Transformation | Purpose |
|---|---|
| Functionalization | Remove mutation |
| SSA conversion | Make dataflow explicit |
| Alias analysis | Track shared state |
| Effect isolation | Separate differentiable regions |
| Purity inference | Enable optimization |

The resulting IR behaves more like a pure functional program even when the original source was imperative.

### Why Effect Systems Matter

Without explicit effect handling, AD systems become difficult to reason about.

Incorrect handling of mutation can produce:

- Wrong gradients
- Silent corruption
- Invalid checkpoint recomputation
- Nondeterministic derivatives
- Excessive memory retention

As differentiable programming expands into systems software, databases, simulators, robotics, and distributed infrastructure, these issues become central.

### Broader Perspective

Effect systems and mutation handling ultimately determine whether differentiation behaves as a mathematically meaningful transformation or merely as a runtime heuristic.

Pure functional semantics make differentiation elegant:

$$
D(f \circ g)
=
D(f) \circ D(g)
$$

Imperative programs introduce time, state, aliasing, and side effects. A differentiable programming language must therefore explain not only how values compute, but also how state evolves across execution.

Effect systems provide the framework for expressing those semantics explicitly.

