# Comparative Architecture Analysis

## Comparative Architecture Analysis

The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different representation of the program, a different execution model, and a different boundary between user code, compiler code, and runtime code.

ADIFOR and Tapenade treat AD as source transformation. TensorFlow and PyTorch treat AD as graph or tape execution over tensor primitives. JAX treats AD as a composable transformation over functional array programs. Zygote treats AD as transformation over Julia IR. Enzyme lowers the problem further and transforms LLVM or MLIR. Tinygrad strips the design down to a minimal dynamic tensor graph.

The same mathematics appears in every system:

$$
\text{local derivative rules} + \text{chain rule} + \text{program dependency order}.
$$

The architectural differences come from where the system captures that dependency order.

## Program Representation

The most important distinction is the representation being differentiated.

| System | Main representation | Typical user code |
|---|---|---|
| ADIFOR | Fortran source | legacy scientific programs |
| Tapenade | Fortran and C source | simulation and HPC codes |
| TensorFlow | tensor operation graph or tape | neural networks and tensor programs |
| PyTorch | dynamic tensor graph | interactive ML research |
| JAX | traced functional program and JAXPR | functional array programs |
| Zygote | Julia SSA IR | generic Julia numerical code |
| Enzyme | LLVM IR and MLIR | compiled multi-language programs |
| Tinygrad | small dynamic tensor graph | minimal deep learning programs |

A source transformer sees the program near the form written by the user. A compiler IR transformer sees a lower-level but more regular form. A tensor graph system sees only operations from its tensor library. A dynamic tape sees the actual executed path, but only for operations it records.

This choice determines what kinds of programs feel natural.

## Mode Support

Most major systems support reverse mode because modern machine learning and optimization usually need gradients of scalar losses with respect to many parameters. Forward mode remains important for directional derivatives, Jacobian-vector products, sensitivity analysis, and higher-order constructions.

| System | Forward mode | Reverse mode | Primary emphasis |
|---|---:|---:|---|
| ADIFOR | yes | limited or secondary | forward source transformation |
| Tapenade | yes | yes | scientific tangent and adjoint code |
| TensorFlow | limited through APIs | yes | tensor reverse mode |
| PyTorch | increasing support | yes | dynamic reverse mode |
| JAX | yes | yes | composable JVP and VJP transformations |
| Zygote | mostly reverse | yes | Julia pullbacks |
| Enzyme | yes in some paths | yes | compiler-level reverse mode |
| Tinygrad | minimal | yes | small reverse-mode engine |

Forward mode propagates tangents with the computation. Reverse mode records or reconstructs enough of the computation to propagate adjoints backward. Systems differ mainly in how they store, reconstruct, or transform that information.

## Source Transformation vs Runtime Tracing

Source transformation produces new code before execution. Runtime tracing or tape systems record what happens during execution.

Source transformation has advantages:

| Advantage | Reason |
|---|---|
| compiler visibility | derivative code can be optimized ahead of time |
| auditability | generated code can be inspected |
| whole-program analysis | call graphs and data flow can be transformed |
| HPC integration | works with existing compilers and build systems |

Runtime tracing has different advantages:

| Advantage | Reason |
|---|---|
| flexibility | follows actual execution path |
| easier interaction | works naturally in notebooks and REPLs |
| dynamic models | handles data-dependent structure naturally |
| lower upfront compilation burden | graph built while running |

The cost of source transformation is compiler complexity. The cost of runtime tracing is runtime overhead and weaker whole-program optimization.

## Tensor Graphs vs General Programs

TensorFlow, PyTorch, JAX, and Tinygrad mostly differentiate tensor programs. ADIFOR, Tapenade, Zygote, and Enzyme aim closer to general program differentiation.

Tensor graphs are easier to differentiate because the primitive set is controlled. Operations such as matrix multiplication, convolution, reduction, broadcasting, and normalization have known derivative rules.

General programs are harder because they contain:

| Feature | Difficulty for AD |
|---|---|
| mutation | old values may be needed in reverse mode |
| aliasing | multiple names may refer to the same memory |
| external calls | derivative semantics may be unknown |
| I/O | usually nondifferentiable |
| dynamic allocation | adjoint storage becomes complex |
| recursion | reverse execution needs stack structure |
| low-level pointers | activity analysis becomes harder |

A tensor graph system avoids many of these problems by restricting the differentiable world. A general AD system accepts more programs but must solve more compiler and runtime problems.

## Dynamic vs Staged Execution

PyTorch and Tinygrad are dynamic by default. The graph is built as operations execute. JAX and TensorFlow can stage computations for compilation. Enzyme, Tapenade, and ADIFOR work ahead of execution.

| Execution style | Examples | Main benefit | Main cost |
|---|---|---|---|
| dynamic | PyTorch, Tinygrad | flexibility and debugging | overhead and fewer global optimizations |
| staged graph | TensorFlow, JAX | compilation and deployment | tracing semantics |
| source transformation | ADIFOR, Tapenade | explicit generated code | complex tooling |
| compiler IR transformation | Enzyme, Zygote | deep compiler integration | harder debugging |

Dynamic execution feels close to ordinary programming. Staged execution gives the system a larger region to optimize. Compiler transformation gives the system even more structural information, but users may need to understand compilation artifacts when something fails.

## Treatment of State

State is one of the sharpest dividing lines.

JAX pushes users toward explicit state passing. PyTorch permits mutation but guards many unsafe cases with version counters. Zygote prefers functional code and has historically struggled with mutation. Enzyme must reason about memory at the IR level. Tapenade and ADIFOR transform imperative programs directly, so they must handle state through data-flow and activity analysis.

| System | State model |
|---|---|
| JAX | explicit, functional state |
| PyTorch | mutable tensors with autograd checks |
| TensorFlow | variables plus graph semantics |
| Zygote | functional style preferred, mutation difficult |
| Enzyme | compiler memory analysis |
| Tapenade | procedural source analysis |
| ADIFOR | procedural source analysis |
| Tinygrad | simple tensor object state |

Reverse mode over mutation requires either saving old values, recomputing them, or proving they are not needed. This is a core systems problem, not a syntactic inconvenience.

## Memory Strategy

Reverse mode needs forward-pass information during the backward pass. Every system must choose a memory strategy.

| Strategy | Used by | Tradeoff |
|---|---|---|
| tape storage | PyTorch, TensorFlow eager, Tinygrad | simple but memory-heavy |
| checkpointing | Tapenade, TensorFlow, PyTorch, JAX | saves memory, adds recomputation |
| compiler recomputation | Enzyme, JAX, Zygote | can reduce storage, needs analysis |
| explicit derivative arrays | ADIFOR forward mode | predictable but can be large |
| generated adjoint storage | Tapenade, Enzyme | efficient when analysis succeeds |

Memory is often the limiting factor in reverse-mode AD. A derivative program can be mathematically correct but unusable if it stores too much intermediate state.

## Custom Derivatives

All practical AD systems need custom derivative rules.

They are needed when:

| Situation | Example |
|---|---|
| primitive is opaque | external C or Fortran function |
| default derivative is unstable | log-sum-exp, softmax, normalization |
| default derivative is inefficient | linear solvers, eigendecompositions |
| operation is approximate | iterative solver, projection, quantization |
| desired gradient differs from mathematical derivative | straight-through estimator |

The mechanism differs by system.

| System | Custom rule mechanism |
|---|---|
| TensorFlow | `tf.custom_gradient` |
| PyTorch | `torch.autograd.Function` |
| JAX | `custom_jvp`, `custom_vjp` |
| Zygote | ChainRules |
| Enzyme | rules and annotations for external functions |
| Tapenade | user-supplied derivative routines |
| ADIFOR | derivative specifications and transformed subroutines |
| Tinygrad | operation-level backward definitions |

Custom gradients are powerful but dangerous. They create a trusted boundary. Once a user supplies a rule, the AD system usually assumes it is correct.

## Performance Model

Performance depends on representation, mode, compiler access, and memory behavior.

| System family | Performance strength | Performance risk |
|---|---|---|
| source transformation | compiler-optimized generated code | code size and build complexity |
| tensor tape | simple reverse execution | memory and Python overhead |
| staged tensor compiler | fusion and accelerator optimization | tracing and recompilation cost |
| compiler IR AD | low-level optimization and multi-language support | aliasing and IR complexity |
| minimal dynamic engine | clarity and low conceptual overhead | limited kernel and distributed optimization |

There is no universally best architecture. A small neural network experiment, a production inference-training pipeline, a Fortran climate model, and a differentiable C++ simulator impose different constraints.

## Architectural Lessons

The comparison suggests several durable design lessons.

First, AD systems are compiler systems even when they present themselves as libraries. They must analyze dependencies, transform programs, manage memory, and generate backward computations.

Second, reverse mode is a storage problem as much as a calculus problem. The derivative formulas are local and simple. The hard part is preserving exactly the values needed in the backward pass.

Third, purity helps. Functional programs are easier to differentiate, batch, compile, and parallelize. Mutation can be supported, but it raises the cost of analysis.

Fourth, restricting the primitive set improves reliability. Tensor frameworks succeed partly because they differentiate a controlled operation vocabulary rather than the whole host language.

Fifth, interoperability matters. Enzyme’s IR-level approach and Tapenade’s source-level approach address the same practical need: differentiating code that already exists outside machine learning frameworks.

## Choosing an AD System

A practical selection can be framed by the program being differentiated.

| Program type | Suitable systems |
|---|---|
| legacy Fortran simulation | ADIFOR, Tapenade, Enzyme |
| C or C++ numerical kernel | Tapenade, Enzyme |
| Python deep learning model | PyTorch, TensorFlow, JAX |
| functional array program | JAX |
| Julia scientific code | Zygote, Enzyme, other Julia AD systems |
| educational autograd engine | Tinygrad |
| accelerator-heavy tensor workload | TensorFlow, JAX, PyTorch |
| compiler research | Enzyme, Zygote, JAX internals |

The choice is architectural. It depends less on the derivative rule for multiplication and more on program representation, state model, compiler access, and deployment environment.

## Summary

The major AD systems form a spectrum.

At one end, Tinygrad and PyTorch expose dynamic graph reverse mode in a direct user-facing style. TensorFlow and JAX stage tensor programs for optimization and accelerator execution. Zygote moves AD into a high-level language IR. Tapenade and ADIFOR represent the classical source-transformation tradition for scientific codes. Enzyme lowers AD into the compiler backend.

All of them implement the chain rule. Their differences show where each system chooses to locate the chain rule: in source code, in a runtime tape, in a tensor graph, in a functional IR, or in compiler IR.