# Chapter 22. Open Problems

Automatic differentiation works naturally on pure mathematical functions:

$$
y = f(x).
$$

The function receives inputs, computes outputs, and produces no persistent side effects. Most theoretical treatments of AD assume this model.

Large real systems rarely behave this way.

Operating systems, databases, distributed services, simulations, robotics stacks, game engines, browsers, network services, and long-running machine learning pipelines all contain substantial mutable state. Their behavior depends not only on explicit inputs, but also on internal memory, caches, queues, synchronization, randomness, external events, and temporal ordering.

Differentiating such systems remains one of the major open problems in automatic differentiation.

## What Is a Stateful System?

A stateful system maintains internal state across time.

Instead of:

$$
y = f(x),
$$

the system behaves more like:

$$
(s_{t+1}, y_t) = F(s_t, x_t).
$$

Where:

| Symbol | Meaning |
|---|---|
| $s_t$ | internal state at time $t$ |
| $x_t$ | external input |
| $y_t$ | output |
| $F$ | transition function |

Examples include:

| System | Internal state |
|---|---|
| database | pages, indexes, transactions |
| web server | connection pools, caches |
| robot controller | sensor history, control state |
| physics engine | world state |
| distributed system | replicated logs, consensus state |
| neural network training loop | optimizer moments, parameter buffers |
| browser engine | DOM tree, rendering state |
| operating system | scheduler queues, memory tables |

The derivative now depends on the evolution of state over time, not only on one isolated function evaluation.

## Reverse Mode Across Time

Suppose a system evolves for $T$ steps:

$$
s_{t+1} = F(s_t, x_t).
$$

A scalar loss depends on the final state:

$$
L = \ell(s_T).
$$

Reverse mode computes gradients backward through time:

$$
\bar{s}_t =
\bar{s}_{t+1}
\frac{\partial s_{t+1}}{\partial s_t}.
$$

This resembles recurrent neural network backpropagation.

However, large stateful systems introduce much harder conditions:

| Neural network | Large stateful system |
|---|---|
| mostly tensor operations | heterogeneous operations |
| explicit training graph | implicit evolving state |
| bounded semantics | external side effects |
| controlled primitives | arbitrary program behavior |
| accelerator-oriented | distributed heterogeneous runtime |

The problem becomes less like differentiating a tensor graph and more like differentiating a live computational process.

## Mutation

Mutation is the first major obstacle.

Consider:

```python
buffer[i] += x
```

The reverse pass may require:

- previous value of `buffer[i]`
- index `i`
- ordering of writes
- alias relationships

If the same memory is modified repeatedly:

```python
for i in range(n):
    state = update(state, data[i])
```

reverse mode may need all intermediate states:

$$
s_0, s_1, s_2, \ldots, s_T.
$$

Storing every state can become impossible.

This creates the central memory problem of stateful reverse mode.

## Temporal Depth

Long-running systems can evolve for millions or billions of state transitions.

Examples:

| System | Temporal scale |
|---|---|
| database server | months |
| robot controller | continuous |
| online recommender | indefinite |
| distributed cache | persistent |
| simulation | millions of timesteps |

Reverse mode over an indefinitely evolving process is fundamentally difficult because backward propagation requires access to earlier computational history.

A naive tape model becomes infeasible.

## Checkpointing Limits

Checkpointing reduces storage by saving selected states and recomputing others.

For a short numerical simulation this works well.

Large stateful systems introduce new problems:

| Issue | Consequence |
|---|---|
| nondeterminism | recomputation may diverge |
| external events | replay impossible |
| concurrency | ordering may differ |
| random scheduling | execution path unstable |
| network timing | state evolution changes |

Recomputation assumes the system can replay identically. Many real systems cannot guarantee this.

## External Side Effects

Stateful systems interact with the outside world.

Examples:

```python
send_packet(data)
write_to_disk(page)
emit_event(msg)
```

These operations are not ordinary differentiable transformations.

Questions immediately appear:

| Question | Difficulty |
|---|---|
| what is the derivative of a disk write? | unclear semantics |
| what is the derivative of network latency? | stochastic external process |
| what is the derivative of cache eviction? | discontinuous policy |
| what is the derivative of scheduler behavior? | nondeterministic execution |

Most AD systems avoid these questions by treating such operations as nondifferentiable boundaries.

This is acceptable for neural networks. It is not sufficient for general differentiable systems.

## Hidden State

Large systems often contain hidden state not visible in the user program.

Examples include:

- allocator metadata
- kernel page cache
- GPU runtime queues
- filesystem buffers
- thread scheduling state
- hardware prefetch behavior

A system’s behavior may depend on these hidden states even if the program source does not expose them.

This creates a semantic problem:

$$
\text{What exactly is the function being differentiated?}
$$

The observed output may depend on implementation details outside the formal computational graph.

## Differentiable Databases

Databases illustrate the challenge clearly.

A query system contains:

| Component | Stateful behavior |
|---|---|
| buffer manager | page cache |
| optimizer | adaptive plans |
| transaction manager | locks and MVCC |
| storage engine | mutable disk layout |
| replication | distributed logs |

Suppose one wishes to optimize latency:

$$
L = \text{query latency}.
$$

Latency depends on:

- cache state
- branch prediction
- disk layout
- concurrent traffic
- scheduler behavior
- network contention

The resulting function is noisy, discontinuous, and history-dependent.

Differentiating such systems requires combining AD with stochastic modeling, systems instrumentation, and probabilistic reasoning.

## Distributed Systems

Distributed systems create even harder problems.

A distributed system evolves under:

$$
s_{t+1} = F(s_t, x_t, m_t, \sigma_t),
$$

where:

| Symbol | Meaning |
|---|---|
| $m_t$ | message arrivals |
| $\sigma_t$ | scheduling decisions |

The system state depends on message timing and asynchronous interleavings.

Small perturbations may change:

- ordering of events
- consensus leader election
- retry behavior
- timeout triggers
- partition recovery

The execution path may bifurcate discontinuously.

Classical reverse mode assumes stable local derivatives. Distributed systems often violate that assumption.

## Differentiable Simulators

Physics simulators are among the most successful large stateful differentiable systems so far.

A simulator evolves:

$$
s_{t+1} = \Phi(s_t, u_t).
$$

Reverse mode propagates adjoints through time integration.

Differentiable simulators work because:

| Property | Benefit |
|---|---|
| deterministic dynamics | replay possible |
| explicit state | visible transitions |
| mathematical structure | known local derivatives |
| controlled operations | stable semantics |

Even here, problems appear:

- collisions introduce discontinuities
- contact solvers are iterative
- friction models may be nonsmooth
- adaptive timesteps alter trajectories

Large engineering simulators remain difficult to differentiate robustly.

## Concurrency

Concurrency fundamentally complicates AD.

Suppose two threads update shared state:

```text
thread A: x += 1
thread B: x *= 2
```

The final state depends on execution ordering.

Reverse mode would need to know:

- exact interleaving
- memory visibility rules
- synchronization ordering
- atomic semantics

The derivative is therefore partly a property of the scheduler.

Most AD systems implicitly assume sequential execution. Extending reverse mode to concurrent systems remains largely unresolved.

## Sparse Credit Assignment

Large stateful systems often contain delayed effects.

A decision made at time $t$ may affect outcomes at time $t + 10^6$.

This creates a long-horizon credit assignment problem:

$$
\frac{\partial L}{\partial s_t}.
$$

Examples:

| Domain | Long-range effect |
|---|---|
| database caching | later query latency |
| network routing | future congestion |
| compiler optimization | downstream runtime |
| operating systems | future scheduling behavior |
| autonomous systems | delayed task success |

Backpropagating through extremely long horizons is computationally and numerically unstable.

## Hybrid Symbolic and Learned Systems

Modern systems increasingly combine:

- neural components
- rule engines
- databases
- search systems
- planners
- simulators

Example:

```text
retrieval
  -> ranking
  -> planning
  -> execution
  -> environment interaction
```

Some components are differentiable. Others are discrete or symbolic.

Differentiating across such hybrid boundaries remains an open research area.

## Semantic Boundaries

A deeper issue is semantic meaning.

For a tensor operation:

$$
y = x^2,
$$

the derivative is mathematically clear.

For a database transaction scheduler:

$$
y = \text{latency}(x),
$$

what exactly should the derivative mean?

Possible interpretations include:

| Interpretation | Meaning |
|---|---|
| local sensitivity | infinitesimal perturbation |
| expected effect | stochastic average |
| policy gradient | optimization signal |
| finite perturbation | empirical response |
| relaxed continuous model | smoothed approximation |

Different interpretations lead to different "gradients."

The problem becomes partly philosophical: what does differentiability mean for complex computational systems?

## Possible Directions

Several research directions appear promising.

### Differentiable State Abstractions

Instead of differentiating raw mutable systems, one may differentiate structured state models:

$$
s_{t+1} = F_\theta(s_t, x_t).
$$

This reduces hidden behavior and makes adjoint propagation explicit.

### Event-Sourced Systems

Event sourcing naturally preserves history:

```text
state = replay(events)
```

Because prior states can be reconstructed from event logs, reverse replay may become easier.

This is one reason differentiable event-driven architectures are interesting research targets.

### Probabilistic Differentiation

Instead of exact trajectories, differentiate expectations:

$$
\nabla_\theta \mathbb{E}[L].
$$

This connects AD with stochastic control and reinforcement learning.

### Compiler-Level AD

Systems like Enzyme suggest that low-level compiler analysis may help differentiate larger stateful programs by leveraging alias analysis, memory SSA, and optimization infrastructure.

### Differentiable Runtime Systems

Future runtimes may expose:

- reversible execution
- replayable scheduling
- deterministic concurrency
- differentiable memory tracking

This would make state evolution more compatible with reverse mode.

## Why This Problem Matters

Differentiating large stateful systems would enable:

| Capability | Example |
|---|---|
| self-optimizing databases | learned storage layouts |
| differentiable operating systems | adaptive scheduling |
| differentiable networks | optimized routing policies |
| adaptive distributed systems | learned replication strategies |
| autonomous infrastructure | self-tuning clusters |
| differentiable software stacks | end-to-end optimization |

Today, optimization boundaries are usually local. Models optimize tensor parameters. The surrounding systems remain hand-engineered.

Large-scale stateful differentiation would extend optimization into the systems layer itself.

## Summary

Automatic differentiation works best for pure, structured computations with explicit dependency graphs and limited mutable state.

Large stateful systems violate these assumptions.

They contain:

- mutation
- hidden state
- concurrency
- external effects
- nondeterminism
- distributed execution
- long temporal horizons

Differentiating such systems requires more than extending reverse mode mechanically. It requires new semantic models for time, state, memory, concurrency, and external interaction.

This remains one of the central unsolved problems in differentiable programming and systems research.

