# Historical Development

## Historical Development

Automatic differentiation developed from a simple observation: a numerical computation already contains the structure needed to compute its derivative. The program evaluates elementary operations in a definite order. If each elementary operation has a derivative rule, then the chain rule can be applied along the same order, either forward or backward.

This idea appeared before modern machine learning, before tensor libraries, and before contemporary compiler infrastructure. Its roots are in numerical analysis, optimization, scientific computing, and early programming systems.

## Early Numerical Context

Before automatic differentiation became a named field, derivative computation was usually handled in three ways.

First, programmers derived formulas by hand. This was common in mechanics, control, numerical optimization, and physical simulation. Hand derivatives could be fast and accurate, but they were expensive to write and maintain.

Second, programmers used finite differences. This was easy because it treated the program as a black box. But finite differences introduced step-size error, cancellation, and poor scaling with the number of inputs.

Third, symbolic algebra systems differentiated formulas. This worked well for compact expressions, but numerical software was increasingly written as programs with loops, arrays, branches, and library calls.

Automatic differentiation emerged as a fourth method. It kept the accuracy of analytic derivative rules while following the structure of executable programs.

## Wengert Lists

A key early idea was the representation of a computation as a list of intermediate assignments. This is often called a Wengert list.

A function such as

$$
y = \sin(x^2 + 1)
$$

can be written as:

```text
v1 = x * x
v2 = v1 + 1
v3 = sin(v2)
y  = v3
```

This list makes the chain rule mechanical. Forward mode propagates derivatives from `x` to `y`. Reverse mode propagates adjoints from `y` back to `x`.

The Wengert list is important because it converts a formula or program fragment into an ordered computational object. Once computation has this form, differentiation becomes a systematic transformation rather than a manual derivation.

## Forward Accumulation

Forward accumulation was the more direct form of AD.

Each variable carries both a value and a derivative component. For a scalar input, we may write:

$$
(v, \dot{v}).
$$

If

$$
z = xy,
$$

then forward mode computes:

$$
z = xy,
\qquad
\dot{z} = \dot{x}y + x\dot{y}.
$$

This method closely resembles ordinary program evaluation. It can be implemented using operator overloading: redefine arithmetic operations so that they operate on pairs instead of plain numbers.

Forward mode became attractive because it was simple, local, and compatible with existing programming languages. It remains useful for directional derivatives, Jacobian-vector products, and functions with few inputs.

## Reverse Accumulation

Reverse accumulation was the deeper computational insight.

For scalar-output functions with many inputs, reverse mode computes a full gradient efficiently. It does this by first running the program forward, then propagating adjoints backward.

If

$$
y = f(x_1,\ldots,x_n),
$$

reverse mode computes:

$$
\frac{\partial y}{\partial x_1},
\ldots,
\frac{\partial y}{\partial x_n}
$$

with cost often comparable to a small multiple of evaluating $f$.

This is the principle behind backpropagation in neural networks. But the method is more general than neural networks. It applies to any differentiable program whose output is scalar or low-dimensional.

Reverse accumulation required more machinery than forward mode because intermediate values must be stored or recomputed. This introduced the tape, the computational graph, and later checkpointing methods.

## AD and Backpropagation

Backpropagation is reverse mode automatic differentiation specialized to layered models.

In a neural network, a forward pass computes activations layer by layer. A backward pass propagates sensitivities from the loss back through the layers to the parameters.

For a simple composition

$$
L = L(f_k(f_{k-1}(\cdots f_1(x)))),
$$

the derivative with respect to parameters in each layer follows from repeated chain-rule application.

The machine learning community often describes this as backpropagation. The automatic differentiation community describes the same core mechanism as reverse accumulation.

The distinction is historical and cultural. Backpropagation grew through neural network training. Reverse mode AD grew through numerical analysis and scientific computing. Mathematically, they share the same chain-rule engine.

## Source Transformation and Operator Overloading

Two implementation traditions shaped AD systems.

Operator overloading changes the meaning of primitive operations for special numeric types. A number becomes a pair, a dual number, or a tape-tracked object. Existing code can often run with these new types with limited modification.

Source transformation rewrites a program into another program that computes derivatives. Instead of changing runtime behavior through overloaded operators, the system analyzes code and emits derivative code.

Each approach has tradeoffs.

| Approach | Strength | Weakness |
|---|---|---|
| Operator overloading | Simple integration with host languages | Runtime overhead and limited global optimization |
| Source transformation | Better optimization potential | Harder compiler and language integration |
| Tracing | Captures executed operations dynamically | May struggle with data-dependent behavior |
| Graph compilation | Enables global optimization | Requires restricted operation semantics |

Modern systems often combine these approaches.

## Scientific Computing Systems

Before deep learning made AD widely visible, AD was already used in scientific computing. Applications included optimization, sensitivity analysis, computational fluid dynamics, chemistry, weather models, and inverse problems.

Large Fortran and C programs needed derivatives for solvers and design optimization. Manual derivative code was difficult to maintain. Finite differences were too slow or inaccurate. AD tools provided a practical middle path.

Systems such as ADIFOR and Tapenade were built for this setting. They focused on differentiating existing scientific programs, often through source transformation.

This history matters because it shows that AD was designed for general numerical programs, not only neural networks.

## Modern Machine Learning Systems

The rise of deep learning moved automatic differentiation into everyday software practice.

Frameworks such as Theano, TensorFlow, PyTorch, JAX, and many others made reverse mode AD central to model training. Users described computations in Python or graph-building APIs. The system constructed derivative computations automatically.

This changed the role of AD. It became part of the programming model.

A machine learning user now expects to write:

```text
loss = model(params, batch)
grad = gradient(loss, params)
```

and have the system compute the required derivatives.

This expectation has influenced modern programming language design, compiler infrastructure, array libraries, and hardware accelerators.

## Differentiable Programming

The phrase differentiable programming refers to the broader idea that programs can be differentiated as first-class objects.

In this view, AD is not merely a tool for training neural networks. It is a language feature or compiler capability.

A differentiable programming system aims to support derivatives through:

| Feature | Requirement |
|---|---|
| Functions | Differentiable call semantics |
| Control flow | Branch and loop differentiation |
| Data structures | Tangent and adjoint representations |
| Mutation | Correct derivative semantics |
| Libraries | Custom derivative rules |
| Compilation | Optimization of primal and derivative code |
| Hardware | Efficient execution of value and derivative programs |

This expands AD from a numerical technique into a systems problem.

## Why the History Matters

The historical development of automatic differentiation explains its hybrid character.

It is calculus, because it applies derivative rules.

It is numerical analysis, because floating point behavior and stability matter.

It is programming language theory, because programs have semantics.

It is compiler engineering, because derivative code must be transformed and optimized.

It is systems design, because memory, parallelism, kernels, and hardware execution dominate performance.

Modern AD systems inherit all of these concerns. A good AD system must compute correct derivatives, represent program structure, handle realistic language features, and run efficiently on modern hardware.