Automatic differentiation works naturally on pure mathematical functions:
The function receives inputs, computes outputs, and produces no persistent side effects. Most theoretical treatments of AD assume this model.
Large real systems rarely behave this way.
Operating systems, databases, distributed services, simulations, robotics stacks, game engines, browsers, network services, and long-running machine learning pipelines all contain substantial mutable state. Their behavior depends not only on explicit inputs, but also on internal memory, caches, queues, synchronization, randomness, external events, and temporal ordering.
Differentiating such systems remains one of the major open problems in automatic differentiation.
What Is a Stateful System?
A stateful system maintains internal state across time.
Instead of:
the system behaves more like:
Where:
| Symbol | Meaning |
|---|---|
| internal state at time | |
| external input | |
| output | |
| transition function |
Examples include:
| System | Internal state |
|---|---|
| database | pages, indexes, transactions |
| web server | connection pools, caches |
| robot controller | sensor history, control state |
| physics engine | world state |
| distributed system | replicated logs, consensus state |
| neural network training loop | optimizer moments, parameter buffers |
| browser engine | DOM tree, rendering state |
| operating system | scheduler queues, memory tables |
The derivative now depends on the evolution of state over time, not only on one isolated function evaluation.
Reverse Mode Across Time
Suppose a system evolves for steps:
A scalar loss depends on the final state:
Reverse mode computes gradients backward through time:
This resembles recurrent neural network backpropagation.
However, large stateful systems introduce much harder conditions:
| Neural network | Large stateful system |
|---|---|
| mostly tensor operations | heterogeneous operations |
| explicit training graph | implicit evolving state |
| bounded semantics | external side effects |
| controlled primitives | arbitrary program behavior |
| accelerator-oriented | distributed heterogeneous runtime |
The problem becomes less like differentiating a tensor graph and more like differentiating a live computational process.
Mutation
Mutation is the first major obstacle.
Consider:
buffer[i] += xThe reverse pass may require:
- previous value of
buffer[i] - index
i - ordering of writes
- alias relationships
If the same memory is modified repeatedly:
for i in range(n):
state = update(state, data[i])reverse mode may need all intermediate states:
Storing every state can become impossible.
This creates the central memory problem of stateful reverse mode.
Temporal Depth
Long-running systems can evolve for millions or billions of state transitions.
Examples:
| System | Temporal scale |
|---|---|
| database server | months |
| robot controller | continuous |
| online recommender | indefinite |
| distributed cache | persistent |
| simulation | millions of timesteps |
Reverse mode over an indefinitely evolving process is fundamentally difficult because backward propagation requires access to earlier computational history.
A naive tape model becomes infeasible.
Checkpointing Limits
Checkpointing reduces storage by saving selected states and recomputing others.
For a short numerical simulation this works well.
Large stateful systems introduce new problems:
| Issue | Consequence |
|---|---|
| nondeterminism | recomputation may diverge |
| external events | replay impossible |
| concurrency | ordering may differ |
| random scheduling | execution path unstable |
| network timing | state evolution changes |
Recomputation assumes the system can replay identically. Many real systems cannot guarantee this.
External Side Effects
Stateful systems interact with the outside world.
Examples:
send_packet(data)
write_to_disk(page)
emit_event(msg)These operations are not ordinary differentiable transformations.
Questions immediately appear:
| Question | Difficulty |
|---|---|
| what is the derivative of a disk write? | unclear semantics |
| what is the derivative of network latency? | stochastic external process |
| what is the derivative of cache eviction? | discontinuous policy |
| what is the derivative of scheduler behavior? | nondeterministic execution |
Most AD systems avoid these questions by treating such operations as nondifferentiable boundaries.
This is acceptable for neural networks. It is not sufficient for general differentiable systems.
Hidden State
Large systems often contain hidden state not visible in the user program.
Examples include:
- allocator metadata
- kernel page cache
- GPU runtime queues
- filesystem buffers
- thread scheduling state
- hardware prefetch behavior
A system’s behavior may depend on these hidden states even if the program source does not expose them.
This creates a semantic problem:
The observed output may depend on implementation details outside the formal computational graph.
Differentiable Databases
Databases illustrate the challenge clearly.
A query system contains:
| Component | Stateful behavior |
|---|---|
| buffer manager | page cache |
| optimizer | adaptive plans |
| transaction manager | locks and MVCC |
| storage engine | mutable disk layout |
| replication | distributed logs |
Suppose one wishes to optimize latency:
Latency depends on:
- cache state
- branch prediction
- disk layout
- concurrent traffic
- scheduler behavior
- network contention
The resulting function is noisy, discontinuous, and history-dependent.
Differentiating such systems requires combining AD with stochastic modeling, systems instrumentation, and probabilistic reasoning.
Distributed Systems
Distributed systems create even harder problems.
A distributed system evolves under:
where:
| Symbol | Meaning |
|---|---|
| message arrivals | |
| scheduling decisions |
The system state depends on message timing and asynchronous interleavings.
Small perturbations may change:
- ordering of events
- consensus leader election
- retry behavior
- timeout triggers
- partition recovery
The execution path may bifurcate discontinuously.
Classical reverse mode assumes stable local derivatives. Distributed systems often violate that assumption.
Differentiable Simulators
Physics simulators are among the most successful large stateful differentiable systems so far.
A simulator evolves:
Reverse mode propagates adjoints through time integration.
Differentiable simulators work because:
| Property | Benefit |
|---|---|
| deterministic dynamics | replay possible |
| explicit state | visible transitions |
| mathematical structure | known local derivatives |
| controlled operations | stable semantics |
Even here, problems appear:
- collisions introduce discontinuities
- contact solvers are iterative
- friction models may be nonsmooth
- adaptive timesteps alter trajectories
Large engineering simulators remain difficult to differentiate robustly.
Concurrency
Concurrency fundamentally complicates AD.
Suppose two threads update shared state:
thread A: x += 1
thread B: x *= 2The final state depends on execution ordering.
Reverse mode would need to know:
- exact interleaving
- memory visibility rules
- synchronization ordering
- atomic semantics
The derivative is therefore partly a property of the scheduler.
Most AD systems implicitly assume sequential execution. Extending reverse mode to concurrent systems remains largely unresolved.
Sparse Credit Assignment
Large stateful systems often contain delayed effects.
A decision made at time may affect outcomes at time .
This creates a long-horizon credit assignment problem:
Examples:
| Domain | Long-range effect |
|---|---|
| database caching | later query latency |
| network routing | future congestion |
| compiler optimization | downstream runtime |
| operating systems | future scheduling behavior |
| autonomous systems | delayed task success |
Backpropagating through extremely long horizons is computationally and numerically unstable.
Hybrid Symbolic and Learned Systems
Modern systems increasingly combine:
- neural components
- rule engines
- databases
- search systems
- planners
- simulators
Example:
retrieval
-> ranking
-> planning
-> execution
-> environment interactionSome components are differentiable. Others are discrete or symbolic.
Differentiating across such hybrid boundaries remains an open research area.
Semantic Boundaries
A deeper issue is semantic meaning.
For a tensor operation:
the derivative is mathematically clear.
For a database transaction scheduler:
what exactly should the derivative mean?
Possible interpretations include:
| Interpretation | Meaning |
|---|---|
| local sensitivity | infinitesimal perturbation |
| expected effect | stochastic average |
| policy gradient | optimization signal |
| finite perturbation | empirical response |
| relaxed continuous model | smoothed approximation |
Different interpretations lead to different “gradients.”
The problem becomes partly philosophical: what does differentiability mean for complex computational systems?
Possible Directions
Several research directions appear promising.
Differentiable State Abstractions
Instead of differentiating raw mutable systems, one may differentiate structured state models:
This reduces hidden behavior and makes adjoint propagation explicit.
Event-Sourced Systems
Event sourcing naturally preserves history:
state = replay(events)Because prior states can be reconstructed from event logs, reverse replay may become easier.
This is one reason differentiable event-driven architectures are interesting research targets.
Probabilistic Differentiation
Instead of exact trajectories, differentiate expectations:
This connects AD with stochastic control and reinforcement learning.
Compiler-Level AD
Systems like Enzyme suggest that low-level compiler analysis may help differentiate larger stateful programs by leveraging alias analysis, memory SSA, and optimization infrastructure.
Differentiable Runtime Systems
Future runtimes may expose:
- reversible execution
- replayable scheduling
- deterministic concurrency
- differentiable memory tracking
This would make state evolution more compatible with reverse mode.
Why This Problem Matters
Differentiating large stateful systems would enable:
| Capability | Example |
|---|---|
| self-optimizing databases | learned storage layouts |
| differentiable operating systems | adaptive scheduling |
| differentiable networks | optimized routing policies |
| adaptive distributed systems | learned replication strategies |
| autonomous infrastructure | self-tuning clusters |
| differentiable software stacks | end-to-end optimization |
Today, optimization boundaries are usually local. Models optimize tensor parameters. The surrounding systems remain hand-engineered.
Large-scale stateful differentiation would extend optimization into the systems layer itself.
Summary
Automatic differentiation works best for pure, structured computations with explicit dependency graphs and limited mutable state.
Large stateful systems violate these assumptions.
They contain:
- mutation
- hidden state
- concurrency
- external effects
- nondeterminism
- distributed execution
- long temporal horizons
Differentiating such systems requires more than extending reverse mode mechanically. It requires new semantic models for time, state, memory, concurrency, and external interaction.
This remains one of the central unsolved problems in differentiable programming and systems research.