Skip to content

Minimal Reverse Mode Engine

Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside...

Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside values, reverse mode separates computation into two phases:

  1. Forward evaluation
  2. Backward gradient propagation

The forward pass computes values and records dependency information. The backward pass propagates sensitivities from outputs toward inputs.

For a scalar-valued function

f:RnR f : \mathbb{R}^n \to \mathbb{R}

reverse mode computes the full gradient efficiently:

f(x) \nabla f(x)

This efficiency is the reason reverse mode dominates machine learning and deep neural network training.

The Core Idea

Suppose:

z=f(x,y) z = f(x, y)

During evaluation we compute intermediate values:

x, y -> intermediates -> z

Reverse mode introduces an adjoint for each intermediate variable:

vˉ=zv \bar{v} = \frac{\partial z}{\partial v}

The adjoint measures how sensitive the final output is to that variable.

The backward pass starts from:

zˉ=1 \bar{z} = 1

and propagates derivatives backward through every operation.

This is simply repeated application of the chain rule.

A Minimal Computational Graph

A reverse mode engine typically represents computations as nodes.

A minimal node structure:

type Node struct {
    Value float64
    Grad  float64

    Prev []*Node
    Backward func()
}

Value stores the primal value.

Grad stores the accumulated adjoint:

outputnode \frac{\partial output}{\partial node}

Prev stores dependencies.

Backward defines how gradients propagate to parents.

This is enough to build a minimal reverse mode engine.

Variables and Constants

Variables are nodes with user-provided values.

func Var(x float64) *Node {
    return &Node{
        Value: x,
    }
}

Constants are identical structurally. They simply receive no meaningful gradient unless connected to outputs.

func Const(x float64) *Node {
    return &Node{
        Value: x,
    }
}

In minimal systems there is often no distinction between variables and constants at the node level.

Addition

Consider:

z=x+y z = x + y

Then:

zx=1 \frac{\partial z}{\partial x} = 1 zy=1 \frac{\partial z}{\partial y} = 1

The implementation:

func Add(a, b *Node) *Node {
    out := &Node{
        Value: a.Value + b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += out.Grad
        b.Grad += out.Grad
    }

    return out
}

The key idea:

Parents accumulate contributions from child gradients.

Reverse mode is fundamentally gradient accumulation.

Multiplication

For:

z=xy z = xy

we have:

zx=y \frac{\partial z}{\partial x} = y zy=x \frac{\partial z}{\partial y} = x

Implementation:

func Mul(a, b *Node) *Node {
    out := &Node{
        Value: a.Value * b.Value,
        Prev:  []*Node{a, b},
    }

    out.Backward = func() {
        a.Grad += b.Value * out.Grad
        b.Grad += a.Value * out.Grad
    }

    return out
}

Each backward rule applies the local derivative multiplied by the incoming adjoint.

This is the reverse chain rule.

Elementary Functions

For:

y=sin(x) y = \sin(x)

we know:

dydx=cos(x) \frac{dy}{dx} = \cos(x)

Implementation:

func Sin(x *Node) *Node {
    out := &Node{
        Value: math.Sin(x.Value),
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += math.Cos(x.Value) * out.Grad
    }

    return out
}

For exponentials:

func Exp(x *Node) *Node {
    v := math.Exp(x.Value)

    out := &Node{
        Value: v,
        Prev:  []*Node{x},
    }

    out.Backward = func() {
        x.Grad += v * out.Grad
    }

    return out
}

Every primitive operation defines:

  • forward evaluation
  • local backward propagation

Nothing more is required.

Example Computation

Consider:

f(x)=x2+3x+2 f(x) = x^2 + 3x + 2

Implementation:

func F(x *Node) *Node {
    return Add(
        Add(
            Mul(x, x),
            Mul(Const(3), x),
        ),
        Const(2),
    )
}

Evaluation:

func main() {
    x := Var(5)

    y := F(x)

    Backprop(y)

    fmt.Println("value:", y.Value)
    fmt.Println("gradient:", x.Grad)
}

Expected result:

value: 42
gradient: 13

The forward pass computes values. The backward pass computes gradients.

Topological Ordering

Backward propagation must occur in reverse dependency order.

Example:

x -> a -> b -> output

We cannot propagate gradients through a before processing b.

A minimal engine therefore performs a topological traversal.

func TopoSort(
    n *Node,
    visited map[*Node]bool,
    order *[]*Node,
) {
    if visited[n] {
        return
    }

    visited[n] = true

    for _, p := range n.Prev {
        TopoSort(p, visited, order)
    }

    *order = append(*order, n)
}

This produces nodes in forward dependency order.

The backward pass iterates in reverse:

func Backprop(root *Node) {
    var order []*Node

    TopoSort(
        root,
        map[*Node]bool{},
        &order,
    )

    root.Grad = 1

    for i := len(order)-1; i >= 0; i-- {
        if order[i].Backward != nil {
            order[i].Backward()
        }
    }
}

This is the core reverse mode algorithm.

Gradient Accumulation

A subtle but important detail:

a.Grad += ...

not:

a.Grad = ...

Gradients must accumulate because variables may influence outputs through multiple paths.

Example:

f(x)=x2+x f(x) = x^2 + x

The variable xx contributes through two independent branches.

Reverse mode sums all contributions:

dfdx=(x2)x+xx \frac{df}{dx} = \frac{\partial(x^2)}{\partial x} + \frac{\partial x}{\partial x}

This accumulation property is essential.

Computational Complexity

Suppose:

f:RnR f : \mathbb{R}^n \to \mathbb{R}

Forward mode:

  • computes one directional derivative per pass
  • full gradient costs roughly O(n)O(n) passes

Reverse mode:

  • computes the full gradient in one backward pass

This makes reverse mode asymptotically favorable for:

  • scalar losses
  • large parameter spaces
  • neural networks

However, reverse mode must store intermediate values and graph structure from the forward pass.

Memory becomes the central systems problem.

Minimal Engine Architecture

A tiny reverse mode engine needs only:

ComponentPurpose
NodeStores value, gradient, dependencies
Primitive opsDefine local derivative rules
Topological traversalOrders backward propagation
Gradient accumulationCombines contributions
Backward passExecutes reverse chain rule

This minimal structure already supports:

  • scalar gradients
  • arbitrary expression graphs
  • nested compositions
  • reusable primitives

Modern deep learning frameworks are vastly more complex, but their core mechanism is still this architecture.

Dynamic Graph Construction

One important property of this design:

The graph is constructed during execution.

This is called a dynamic computation graph.

Advantages:

  • ordinary language control flow
  • easy debugging
  • natural recursion
  • flexible runtime behavior

Disadvantages:

  • runtime overhead
  • graph allocation costs
  • reduced compiler optimization opportunities

Frameworks such as entity[“software”,“PyTorch”,“deep learning framework”] popularized this style.

Static graph systems instead separate graph construction from execution.

Correctness Intuition

Suppose node vv contributes to output yy.

The adjoint:

vˉ=yv \bar{v} = \frac{\partial y}{\partial v}

For every edge:

vu v \to u

the chain rule gives:

yv=uyuuv \frac{\partial y}{\partial v} = \sum_u \frac{\partial y}{\partial u} \frac{\partial u}{\partial v}

Reverse mode computes exactly this recurrence.

Each node receives contributions from downstream nodes and accumulates them.

The algorithm is therefore a graph-based implementation of the multivariable chain rule.

Memory Limitations

A minimal reverse mode engine stores every intermediate node until the backward pass finishes.

This causes memory growth proportional to computation size.

Large models introduce several problems:

ProblemCause
Activation memory explosionStored intermediates
Long backward latencySequential traversal
Graph allocation overheadDynamic node creation
GPU synchronization costsDependency ordering
Recomputation tradeoffsCheckpointing decisions

Most modern AD systems spend more engineering effort on memory management than derivative rules.

Comparison with Forward Mode

PropertyForward ModeReverse Mode
Derivative directionInputs → outputsOutputs → inputs
Best forFew inputsFew outputs
Memory usageLowHigher
Gradient cost for scalar outputO(n) passesO(1) backward pass
Graph storageUsually unnecessaryRequired
Jacobian-vector productNaturalIndirect
Vector-Jacobian productIndirectNatural

Both modes are implementations of the chain rule. They differ only in traversal direction and accumulation strategy.

Minimal Reverse Mode Engine Summary

A minimal reverse mode engine consists of:

  • nodes storing values and gradients
  • primitive operations with local backward rules
  • topological traversal
  • gradient accumulation

The engine evaluates programs normally during the forward pass, then replays local derivative rules backward across the graph.

This architecture scales from tiny educational systems to industrial deep learning runtimes. The surrounding infrastructure changes dramatically at scale, but the mathematical core remains almost identical.