Automatic differentiation is easiest to define for pure functions. A pure function behaves like a mathematical mapping: it consumes inputs, produces outputs, and has no observable side effects. Real programs are rarely this simple. They mutate arrays, allocate memory, read state, launch kernels, perform I/O, sample random values, and synchronize across devices.

An effect system describes these behaviors explicitly. It tracks what a computation does in addition to what value it returns. This becomes important for automatic differentiation because effects change the meaning of the derivative transformation.

Pure Computation and the Chain Rule

Consider a pure function:

f : X \rightarrow Y

The derivative transformation:

D(f)

can be defined compositionally.

If:

f = h \circ g

then:

D(f) = D(h) \circ D(g)



D(f)=D(h) \circ D(g)

This works because the computation has no hidden state. Every dependency is visible in the expression graph.

Mutation breaks this simplicity because values can change over time.

Mutation Changes Semantics

Suppose:

x = [1.0, 2.0]
y = x
x[0] = 5.0

Now y also changes because x and y alias the same storage.

A derivative system must answer:

Which version of x does the gradient refer to?
Was the original value needed later?
Should the reverse pass reconstruct old state?
Is the mutation observable outside the differentiated region?

These are semantic questions, not only implementation details.

Effects as Typed Behavior

An effect system augments function types with behavioral information.

Instead of:

f : A -> B

the language may track:

f : A -> B [Pure]

or:

g : A -> B [State]

Possible effects include:

Effect	Meaning
Pure	No observable side effects
State	Mutation of memory
IO	External interaction
Random	Random number generation
Exception	Non-local control flow
Async	Concurrent execution
Device	GPU or accelerator execution

Differentiation semantics depend strongly on these effects.

Mutation in Forward Mode

Forward mode is relatively tolerant of mutation because derivatives propagate alongside primal execution.

Example:

x = Dual(2.0, 1.0)
x = x * x

The primal and tangent update together.

Problems appear when:

Aliasing exists
Mutation order matters
Shared buffers are updated
State escapes the local computation

Even then, forward mode usually follows ordinary execution naturally.

Mutation in Reverse Mode

Reverse mode is harder because the backward pass occurs after forward execution.

Suppose:

x = 3
x = x * 2
x = x + 1

The reverse pass must propagate gradients through earlier states of x.

Conceptually:

Step	Primal value
Initial	3
After multiply	6
After add	7

Backward propagation needs access to the intermediate value 6.

If the old value was overwritten, reverse mode may fail unless the system saved enough information.

The State Reconstruction Problem

Reverse mode often needs to reconstruct earlier program state.

This creates the central problem of mutation-aware AD:

Requirement	Reason
Preserve overwritten values	Needed for derivative rules
Preserve execution order	Reverse pass depends on it
Preserve aliasing semantics	Shared state affects gradients
Preserve control decisions	Branches determine path

Several implementation strategies exist.

Tape-Based State Saving

A tape records operations and saved values during the forward pass.

Example:

store old x = 3
x = x * 2

store old x = 6
x = x + 1

The backward pass replays derivative rules using saved intermediates.

Advantages:

Benefit	Explanation
Correct reverse semantics	Exact replay
Dynamic control flow	Works naturally
Flexible mutation	Handles arbitrary updates

Costs:

Cost	Explanation
Memory overhead	Stores many intermediates
Allocation pressure	Tape growth
Runtime overhead	Recording operations

Large tensor systems can spend more memory on saved activations than on parameters themselves.

Functionalization

Functionalization rewrites mutation into immutable updates.

Instead of:

x[i] += y

the system internally rewrites:

x2 = update(x, i, x[i] + y)

This restores functional semantics.

Advantages:

Benefit	Explanation
Simpler AD rules	Pure transformations
Easier optimization	Explicit dataflow
Safer parallelism	No hidden state

Disadvantages:

Cost	Explanation
Extra allocations	Immutable copies
Loss of locality	More memory movement
Potential slowdown	Large tensor duplication

Modern systems often functionalize internally while exposing mutable syntax externally.

Versioned Tensors

Some frameworks track tensor versions.

Example:

x += y

increments an internal version counter.

If reverse mode later needs the old tensor value, the system checks whether the tensor was modified unsafely.

This detects invalid gradient computations.

Version tracking is common in eager tensor frameworks because it preserves user-facing imperative syntax while protecting reverse-mode correctness.

Linear and Ownership Types

Ownership systems help manage mutation safely.

A linearly owned value has exactly one owner:

x : Linear<Tensor>

If ownership is unique, mutation becomes safer because no aliases exist.

Advantages for AD:

Property	Benefit
No hidden aliases	Easier adjoint reasoning
Safe destructive updates	Efficient kernels
Better memory reuse	Reduced allocations
Predictable lifetimes	Easier checkpointing

Languages with ownership systems can optimize reverse mode aggressively because they know when values can be safely reused or overwritten.

Mutation in Tensor Systems

Tensor frameworks rely heavily on mutation internally even when the user API appears functional.

Examples include:

In-place GPU kernels
Gradient accumulation buffers
Optimizer updates
Memory pooling
Buffer reuse

An optimizer step:

w -= lr * grad

mutates parameter storage.

The AD system must ensure this occurs outside the differentiated region or treat it explicitly as stateful computation.

Effects and Control Flow

Effects interact with branching and loops.

Example:

if x > 0:
    state += 1
else:
    state -= 1

The backward pass depends on:

Which branch executed
What state existed before mutation
Whether state affects future computation

Differentiating such programs requires preserving both control-flow structure and state transitions.

Randomness and Probabilistic Effects

Random sampling is another important effect.

Example:

z = sample_normal(mu, sigma)

Sampling is not ordinarily differentiable because the output changes discontinuously with the random seed.

Several approaches exist:

Method	Idea
Reparameterization	Move randomness outside differentiable path
Score-function estimators	Differentiate expectation instead
Straight-through estimators	Approximate backward behavior
Deterministic replay	Reuse random seeds

An effect system can track random behavior explicitly:

f : A -> B [Random]

This prevents the compiler from incorrectly treating stochastic computations as pure.

I/O and External State

I/O effects usually lie outside derivative semantics.

Example:

print(loss)
save_model(weights)

These operations do not contribute mathematical derivatives.

However, external state can still influence computation:

Effect	Problem
Reading files	Input changes across executions
Global variables	Hidden dependencies
Device synchronization	Timing-sensitive behavior
Distributed state	Non-local mutation

Differentiable regions are therefore often restricted to pure numerical subgraphs.

Effect Systems for AD Compilers

An AD compiler benefits from explicit effect tracking.

The compiler can answer:

Question	Why it matters
Is this function pure?	Safe for source transformation
Does it mutate memory?	Requires state preservation
Does randomness occur?	Requires stochastic semantics
Can operations reorder?	Important for reverse correctness
Can buffers be reused?	Memory optimization

Effect-aware optimization is increasingly important in large differentiable systems.

Checkpointing and Effects

Checkpointing trades memory for recomputation.

Instead of storing all intermediates:

Save selected checkpoints.
Recompute missing values during reverse mode.

This works best for pure computations.

If mutation or randomness occurs, recomputation may not reproduce the original execution unless:

State is restored
Random seeds are replayed
External effects are isolated

Effects therefore constrain checkpointing strategies.

Parallelism and Concurrency

Concurrent mutation complicates AD further.

Example:

parallel_for(...)

Multiple threads may update shared adjoints.

The AD system must define:

Issue	Requirement
Adjoint accumulation	Associative reduction
Ordering	Deterministic or nondeterministic
Synchronization	Correct race-free updates
Distributed state	Cross-device aggregation

Effect systems can describe concurrent regions explicitly.

Referential Transparency Recovery

Many AD systems attempt to recover functional semantics internally even if the surface language is imperative.

This often involves:

Transformation	Purpose
Functionalization	Remove mutation
SSA conversion	Make dataflow explicit
Alias analysis	Track shared state
Effect isolation	Separate differentiable regions
Purity inference	Enable optimization

The resulting IR behaves more like a pure functional program even when the original source was imperative.

Why Effect Systems Matter

Without explicit effect handling, AD systems become difficult to reason about.

Incorrect handling of mutation can produce:

Wrong gradients
Silent corruption
Invalid checkpoint recomputation
Nondeterministic derivatives
Excessive memory retention

As differentiable programming expands into systems software, databases, simulators, robotics, and distributed infrastructure, these issues become central.

Broader Perspective

Effect systems and mutation handling ultimately determine whether differentiation behaves as a mathematically meaningful transformation or merely as a runtime heuristic.

Pure functional semantics make differentiation elegant:

$$ D(f \circ g)

D(f) \circ D(g) $$

Imperative programs introduce time, state, aliasing, and side effects. A differentiable programming language must therefore explain not only how values compute, but also how state evolves across execution.

Effect systems provide the framework for expressing those semantics explicitly.

Effect Systems and Mutation

Pure Computation and the Chain Rule

Mutation Changes Semantics

Effects as Typed Behavior

Mutation in Forward Mode

Mutation in Reverse Mode

The State Reconstruction Problem

Tape-Based State Saving

Functionalization

Versioned Tensors

Linear and Ownership Types

Mutation in Tensor Systems

Effects and Control Flow

Randomness and Probabilistic Effects

I/O and External State

Effect Systems for AD Compilers

Checkpointing and Effects

Parallelism and Concurrency

Referential Transparency Recovery

Why Effect Systems Matter

Broader Perspective

$$ D(f \circ g)