Skip to content

Chapter 22. Open Problems

Automatic differentiation works naturally on pure mathematical functions:

Automatic differentiation works naturally on pure mathematical functions:

y=f(x). y = f(x).

The function receives inputs, computes outputs, and produces no persistent side effects. Most theoretical treatments of AD assume this model.

Large real systems rarely behave this way.

Operating systems, databases, distributed services, simulations, robotics stacks, game engines, browsers, network services, and long-running machine learning pipelines all contain substantial mutable state. Their behavior depends not only on explicit inputs, but also on internal memory, caches, queues, synchronization, randomness, external events, and temporal ordering.

Differentiating such systems remains one of the major open problems in automatic differentiation.

What Is a Stateful System?

A stateful system maintains internal state across time.

Instead of:

y=f(x), y = f(x),

the system behaves more like:

(st+1,yt)=F(st,xt). (s_{t+1}, y_t) = F(s_t, x_t).

Where:

SymbolMeaning
sts_tinternal state at time tt
xtx_texternal input
yty_toutput
FFtransition function

Examples include:

SystemInternal state
databasepages, indexes, transactions
web serverconnection pools, caches
robot controllersensor history, control state
physics engineworld state
distributed systemreplicated logs, consensus state
neural network training loopoptimizer moments, parameter buffers
browser engineDOM tree, rendering state
operating systemscheduler queues, memory tables

The derivative now depends on the evolution of state over time, not only on one isolated function evaluation.

Reverse Mode Across Time

Suppose a system evolves for TT steps:

st+1=F(st,xt). s_{t+1} = F(s_t, x_t).

A scalar loss depends on the final state:

L=(sT). L = \ell(s_T).

Reverse mode computes gradients backward through time:

sˉt=sˉt+1st+1st. \bar{s}_t = \bar{s}_{t+1} \frac{\partial s_{t+1}}{\partial s_t}.

This resembles recurrent neural network backpropagation.

However, large stateful systems introduce much harder conditions:

Neural networkLarge stateful system
mostly tensor operationsheterogeneous operations
explicit training graphimplicit evolving state
bounded semanticsexternal side effects
controlled primitivesarbitrary program behavior
accelerator-orienteddistributed heterogeneous runtime

The problem becomes less like differentiating a tensor graph and more like differentiating a live computational process.

Mutation

Mutation is the first major obstacle.

Consider:

buffer[i] += x

The reverse pass may require:

  • previous value of buffer[i]
  • index i
  • ordering of writes
  • alias relationships

If the same memory is modified repeatedly:

for i in range(n):
    state = update(state, data[i])

reverse mode may need all intermediate states:

s0,s1,s2,,sT. s_0, s_1, s_2, \ldots, s_T.

Storing every state can become impossible.

This creates the central memory problem of stateful reverse mode.

Temporal Depth

Long-running systems can evolve for millions or billions of state transitions.

Examples:

SystemTemporal scale
database servermonths
robot controllercontinuous
online recommenderindefinite
distributed cachepersistent
simulationmillions of timesteps

Reverse mode over an indefinitely evolving process is fundamentally difficult because backward propagation requires access to earlier computational history.

A naive tape model becomes infeasible.

Checkpointing Limits

Checkpointing reduces storage by saving selected states and recomputing others.

For a short numerical simulation this works well.

Large stateful systems introduce new problems:

IssueConsequence
nondeterminismrecomputation may diverge
external eventsreplay impossible
concurrencyordering may differ
random schedulingexecution path unstable
network timingstate evolution changes

Recomputation assumes the system can replay identically. Many real systems cannot guarantee this.

External Side Effects

Stateful systems interact with the outside world.

Examples:

send_packet(data)
write_to_disk(page)
emit_event(msg)

These operations are not ordinary differentiable transformations.

Questions immediately appear:

QuestionDifficulty
what is the derivative of a disk write?unclear semantics
what is the derivative of network latency?stochastic external process
what is the derivative of cache eviction?discontinuous policy
what is the derivative of scheduler behavior?nondeterministic execution

Most AD systems avoid these questions by treating such operations as nondifferentiable boundaries.

This is acceptable for neural networks. It is not sufficient for general differentiable systems.

Hidden State

Large systems often contain hidden state not visible in the user program.

Examples include:

  • allocator metadata
  • kernel page cache
  • GPU runtime queues
  • filesystem buffers
  • thread scheduling state
  • hardware prefetch behavior

A system’s behavior may depend on these hidden states even if the program source does not expose them.

This creates a semantic problem:

What exactly is the function being differentiated? \text{What exactly is the function being differentiated?}

The observed output may depend on implementation details outside the formal computational graph.

Differentiable Databases

Databases illustrate the challenge clearly.

A query system contains:

ComponentStateful behavior
buffer managerpage cache
optimizeradaptive plans
transaction managerlocks and MVCC
storage enginemutable disk layout
replicationdistributed logs

Suppose one wishes to optimize latency:

L=query latency. L = \text{query latency}.

Latency depends on:

  • cache state
  • branch prediction
  • disk layout
  • concurrent traffic
  • scheduler behavior
  • network contention

The resulting function is noisy, discontinuous, and history-dependent.

Differentiating such systems requires combining AD with stochastic modeling, systems instrumentation, and probabilistic reasoning.

Distributed Systems

Distributed systems create even harder problems.

A distributed system evolves under:

st+1=F(st,xt,mt,σt), s_{t+1} = F(s_t, x_t, m_t, \sigma_t),

where:

SymbolMeaning
mtm_tmessage arrivals
σt\sigma_tscheduling decisions

The system state depends on message timing and asynchronous interleavings.

Small perturbations may change:

  • ordering of events
  • consensus leader election
  • retry behavior
  • timeout triggers
  • partition recovery

The execution path may bifurcate discontinuously.

Classical reverse mode assumes stable local derivatives. Distributed systems often violate that assumption.

Differentiable Simulators

Physics simulators are among the most successful large stateful differentiable systems so far.

A simulator evolves:

st+1=Φ(st,ut). s_{t+1} = \Phi(s_t, u_t).

Reverse mode propagates adjoints through time integration.

Differentiable simulators work because:

PropertyBenefit
deterministic dynamicsreplay possible
explicit statevisible transitions
mathematical structureknown local derivatives
controlled operationsstable semantics

Even here, problems appear:

  • collisions introduce discontinuities
  • contact solvers are iterative
  • friction models may be nonsmooth
  • adaptive timesteps alter trajectories

Large engineering simulators remain difficult to differentiate robustly.

Concurrency

Concurrency fundamentally complicates AD.

Suppose two threads update shared state:

thread A: x += 1
thread B: x *= 2

The final state depends on execution ordering.

Reverse mode would need to know:

  • exact interleaving
  • memory visibility rules
  • synchronization ordering
  • atomic semantics

The derivative is therefore partly a property of the scheduler.

Most AD systems implicitly assume sequential execution. Extending reverse mode to concurrent systems remains largely unresolved.

Sparse Credit Assignment

Large stateful systems often contain delayed effects.

A decision made at time tt may affect outcomes at time t+106t + 10^6.

This creates a long-horizon credit assignment problem:

Lst. \frac{\partial L}{\partial s_t}.

Examples:

DomainLong-range effect
database cachinglater query latency
network routingfuture congestion
compiler optimizationdownstream runtime
operating systemsfuture scheduling behavior
autonomous systemsdelayed task success

Backpropagating through extremely long horizons is computationally and numerically unstable.

Hybrid Symbolic and Learned Systems

Modern systems increasingly combine:

  • neural components
  • rule engines
  • databases
  • search systems
  • planners
  • simulators

Example:

retrieval
  -> ranking
  -> planning
  -> execution
  -> environment interaction

Some components are differentiable. Others are discrete or symbolic.

Differentiating across such hybrid boundaries remains an open research area.

Semantic Boundaries

A deeper issue is semantic meaning.

For a tensor operation:

y=x2, y = x^2,

the derivative is mathematically clear.

For a database transaction scheduler:

y=latency(x), y = \text{latency}(x),

what exactly should the derivative mean?

Possible interpretations include:

InterpretationMeaning
local sensitivityinfinitesimal perturbation
expected effectstochastic average
policy gradientoptimization signal
finite perturbationempirical response
relaxed continuous modelsmoothed approximation

Different interpretations lead to different “gradients.”

The problem becomes partly philosophical: what does differentiability mean for complex computational systems?

Possible Directions

Several research directions appear promising.

Differentiable State Abstractions

Instead of differentiating raw mutable systems, one may differentiate structured state models:

st+1=Fθ(st,xt). s_{t+1} = F_\theta(s_t, x_t).

This reduces hidden behavior and makes adjoint propagation explicit.

Event-Sourced Systems

Event sourcing naturally preserves history:

state = replay(events)

Because prior states can be reconstructed from event logs, reverse replay may become easier.

This is one reason differentiable event-driven architectures are interesting research targets.

Probabilistic Differentiation

Instead of exact trajectories, differentiate expectations:

θE[L]. \nabla_\theta \mathbb{E}[L].

This connects AD with stochastic control and reinforcement learning.

Compiler-Level AD

Systems like Enzyme suggest that low-level compiler analysis may help differentiate larger stateful programs by leveraging alias analysis, memory SSA, and optimization infrastructure.

Differentiable Runtime Systems

Future runtimes may expose:

  • reversible execution
  • replayable scheduling
  • deterministic concurrency
  • differentiable memory tracking

This would make state evolution more compatible with reverse mode.

Why This Problem Matters

Differentiating large stateful systems would enable:

CapabilityExample
self-optimizing databaseslearned storage layouts
differentiable operating systemsadaptive scheduling
differentiable networksoptimized routing policies
adaptive distributed systemslearned replication strategies
autonomous infrastructureself-tuning clusters
differentiable software stacksend-to-end optimization

Today, optimization boundaries are usually local. Models optimize tensor parameters. The surrounding systems remain hand-engineered.

Large-scale stateful differentiation would extend optimization into the systems layer itself.

Summary

Automatic differentiation works best for pure, structured computations with explicit dependency graphs and limited mutable state.

Large stateful systems violate these assumptions.

They contain:

  • mutation
  • hidden state
  • concurrency
  • external effects
  • nondeterminism
  • distributed execution
  • long temporal horizons

Differentiating such systems requires more than extending reverse mode mechanically. It requires new semantic models for time, state, memory, concurrency, and external interaction.

This remains one of the central unsolved problems in differentiable programming and systems research.