# Minimal Reverse Mode Engine

## Minimal Reverse Mode Engine

Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside values, reverse mode separates computation into two phases:

1. Forward evaluation
2. Backward gradient propagation

The forward pass computes values and records dependency information. The backward pass propagates sensitivities from outputs toward inputs.

For a scalar-valued function

$$
f : \mathbb{R}^n \to \mathbb{R}
$$

reverse mode computes the full gradient efficiently:

$$
\nabla f(x)
$$

This efficiency is the reason reverse mode dominates machine learning and deep neural network training.

## The Core Idea

Suppose:

$$
z = f(x, y)
$$

During evaluation we compute intermediate values:

```text
x, y -> intermediates -> z
```

Reverse mode introduces an adjoint for each intermediate variable:

$$
\bar{v} = \frac{\partial z}{\partial v}
$$

The adjoint measures how sensitive the final output is to that variable.

The backward pass starts from:

$$
\bar{z} = 1
$$

and propagates derivatives backward through every operation.

This is simply repeated application of the chain rule.

## A Minimal Computational Graph

A reverse mode engine typically represents computations as nodes.

A minimal node structure:

```go
type Node struct {
    Value float64
    Grad  float64

    Prev []*Node
    Backward func()
}
```

`Value` stores the primal value.

`Grad` stores the accumulated adjoint:

$$
\frac{\partial output}{\partial node}
$$

`Prev` stores dependencies.

`Backward` defines how gradients propagate to parents.

This is enough to build a minimal reverse mode engine.

## Variables and Constants

Variables are nodes with user-provided values.

```go
func Var(x float64) *Node {
    return &Node{
        Value: x,
    }
}
```

Constants are identical structurally. They simply receive no meaningful gradient unless connected to outputs.

```go
func Const(x float64) *Node {
    return &Node{
        Value: x,
    }
}
```

In minimal systems there is often no distinction between variables and constants at the node level.

## Addition

Consider:

$$
z = x + y
$$

Then:

$$
\frac{\partial z}{\partial x} = 1
$$

$$
\frac{\partial z}{\partial y} = 1
$$

The implementation:

```go
func Add(a, b *Node) *Node {
    out := &Node{
        Value: a.Value + b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += out.Grad
        b.Grad += out.Grad
    }

    return out
}
```

The key idea:

```text
Parents accumulate contributions from child gradients.
```

Reverse mode is fundamentally gradient accumulation.

## Multiplication

For:

$$
z = xy
$$

we have:

$$
\frac{\partial z}{\partial x} = y
$$

$$
\frac{\partial z}{\partial y} = x
$$

Implementation:

```go
func Mul(a, b *Node) *Node {
    out := &Node{
        Value: a.Value * b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += b.Value * out.Grad
        b.Grad += a.Value * out.Grad
    }

    return out
}
```

Each backward rule applies the local derivative multiplied by the incoming adjoint.

This is the reverse chain rule.

## Elementary Functions

For:

$$
y = \sin(x)
$$

we know:

$$
\frac{dy}{dx} = \cos(x)
$$

Implementation:

```go
func Sin(x *Node) *Node {
    out := &Node{
        Value: math.Sin(x.Value),
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += math.Cos(x.Value) * out.Grad
    }

    return out
}
```

For exponentials:

```go
func Exp(x *Node) *Node {
    v := math.Exp(x.Value)

    out := &Node{
        Value: v,
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += v * out.Grad
    }

    return out
}
```

Every primitive operation defines:
- forward evaluation
- local backward propagation

Nothing more is required.

## Example Computation

Consider:

$$
f(x) = x^2 + 3x + 2
$$

Implementation:

```go
func F(x *Node) *Node {
    return Add(
        Add(
            Mul(x, x),
            Mul(Const(3), x),
        ),
        Const(2),
    )
}
```

Evaluation:

```go
func main() {
    x := Var(5)

    y := F(x)

    Backprop(y)

    fmt.Println("value:", y.Value)
    fmt.Println("gradient:", x.Grad)
}
```

Expected result:

```text
value: 42
gradient: 13
```

The forward pass computes values. The backward pass computes gradients.

## Topological Ordering

Backward propagation must occur in reverse dependency order.

Example:

```text
x -> a -> b -> output
```

We cannot propagate gradients through `a` before processing `b`.

A minimal engine therefore performs a topological traversal.

```go
func TopoSort(
    n *Node,
    visited map[*Node]bool,
    order *[]*Node,
) {
    if visited[n] {
        return
    }

    visited[n] = true

    for _, p := range n.Prev {
        TopoSort(p, visited, order)
    }

    *order = append(*order, n)
}
```

This produces nodes in forward dependency order.

The backward pass iterates in reverse:

```go
func Backprop(root *Node) {
    var order []*Node

    TopoSort(
        root,
        map[*Node]bool{},
        &order,
    )

    root.Grad = 1

    for i := len(order)-1; i >= 0; i-- {
        if order[i].Backward != nil {
            order[i].Backward()
        }
    }
}
```

This is the core reverse mode algorithm.

## Gradient Accumulation

A subtle but important detail:

```go
a.Grad += ...
```

not:

```go
a.Grad = ...
```

Gradients must accumulate because variables may influence outputs through multiple paths.

Example:

$$
f(x) = x^2 + x
$$

The variable $x$ contributes through two independent branches.

Reverse mode sums all contributions:

$$
\frac{df}{dx} =
\frac{\partial(x^2)}{\partial x}
+
\frac{\partial x}{\partial x}
$$

This accumulation property is essential.

## Computational Complexity

Suppose:

$$
f : \mathbb{R}^n \to \mathbb{R}
$$

Forward mode:
- computes one directional derivative per pass
- full gradient costs roughly $O(n)$ passes

Reverse mode:
- computes the full gradient in one backward pass

This makes reverse mode asymptotically favorable for:
- scalar losses
- large parameter spaces
- neural networks

However, reverse mode must store intermediate values and graph structure from the forward pass.

Memory becomes the central systems problem.

## Minimal Engine Architecture

A tiny reverse mode engine needs only:

| Component | Purpose |
|---|---|
| Node | Stores value, gradient, dependencies |
| Primitive ops | Define local derivative rules |
| Topological traversal | Orders backward propagation |
| Gradient accumulation | Combines contributions |
| Backward pass | Executes reverse chain rule |

This minimal structure already supports:
- scalar gradients
- arbitrary expression graphs
- nested compositions
- reusable primitives

Modern deep learning frameworks are vastly more complex, but their core mechanism is still this architecture.

## Dynamic Graph Construction

One important property of this design:

```text
The graph is constructed during execution.
```

This is called a dynamic computation graph.

Advantages:
- ordinary language control flow
- easy debugging
- natural recursion
- flexible runtime behavior

Disadvantages:
- runtime overhead
- graph allocation costs
- reduced compiler optimization opportunities

Frameworks such as entity["software","PyTorch","deep learning framework"] popularized this style.

Static graph systems instead separate graph construction from execution.

## Correctness Intuition

Suppose node $v$ contributes to output $y$.

The adjoint:

$$
\bar{v} =
\frac{\partial y}{\partial v}
$$

For every edge:

$$
v \to u
$$

the chain rule gives:

$$
\frac{\partial y}{\partial v} =
\sum_u
\frac{\partial y}{\partial u}
\frac{\partial u}{\partial v}
$$

Reverse mode computes exactly this recurrence.

Each node receives contributions from downstream nodes and accumulates them.

The algorithm is therefore a graph-based implementation of the multivariable chain rule.

## Memory Limitations

A minimal reverse mode engine stores every intermediate node until the backward pass finishes.

This causes memory growth proportional to computation size.

Large models introduce several problems:

| Problem | Cause |
|---|---|
| Activation memory explosion | Stored intermediates |
| Long backward latency | Sequential traversal |
| Graph allocation overhead | Dynamic node creation |
| GPU synchronization costs | Dependency ordering |
| Recomputation tradeoffs | Checkpointing decisions |

Most modern AD systems spend more engineering effort on memory management than derivative rules.

## Comparison with Forward Mode

| Property | Forward Mode | Reverse Mode |
|---|---|---|
| Derivative direction | Inputs → outputs | Outputs → inputs |
| Best for | Few inputs | Few outputs |
| Memory usage | Low | Higher |
| Gradient cost for scalar output | O(n) passes | O(1) backward pass |
| Graph storage | Usually unnecessary | Required |
| Jacobian-vector product | Natural | Indirect |
| Vector-Jacobian product | Indirect | Natural |

Both modes are implementations of the chain rule. They differ only in traversal direction and accumulation strategy.

## Minimal Reverse Mode Engine Summary

A minimal reverse mode engine consists of:
- nodes storing values and gradients
- primitive operations with local backward rules
- topological traversal
- gradient accumulation

The engine evaluates programs normally during the forward pass, then replays local derivative rules backward across the graph.

This architecture scales from tiny educational systems to industrial deep learning runtimes. The surrounding infrastructure changes dramatically at scale, but the mathematical core remains almost identical.

