Skip to content

Comparative Architecture Analysis

The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different...

The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different representation of the program, a different execution model, and a different boundary between user code, compiler code, and runtime code.

ADIFOR and Tapenade treat AD as source transformation. TensorFlow and PyTorch treat AD as graph or tape execution over tensor primitives. JAX treats AD as a composable transformation over functional array programs. Zygote treats AD as transformation over Julia IR. Enzyme lowers the problem further and transforms LLVM or MLIR. Tinygrad strips the design down to a minimal dynamic tensor graph.

The same mathematics appears in every system:

local derivative rules+chain rule+program dependency order. \text{local derivative rules} + \text{chain rule} + \text{program dependency order}.

The architectural differences come from where the system captures that dependency order.

Program Representation

The most important distinction is the representation being differentiated.

SystemMain representationTypical user code
ADIFORFortran sourcelegacy scientific programs
TapenadeFortran and C sourcesimulation and HPC codes
TensorFlowtensor operation graph or tapeneural networks and tensor programs
PyTorchdynamic tensor graphinteractive ML research
JAXtraced functional program and JAXPRfunctional array programs
ZygoteJulia SSA IRgeneric Julia numerical code
EnzymeLLVM IR and MLIRcompiled multi-language programs
Tinygradsmall dynamic tensor graphminimal deep learning programs

A source transformer sees the program near the form written by the user. A compiler IR transformer sees a lower-level but more regular form. A tensor graph system sees only operations from its tensor library. A dynamic tape sees the actual executed path, but only for operations it records.

This choice determines what kinds of programs feel natural.

Mode Support

Most major systems support reverse mode because modern machine learning and optimization usually need gradients of scalar losses with respect to many parameters. Forward mode remains important for directional derivatives, Jacobian-vector products, sensitivity analysis, and higher-order constructions.

SystemForward modeReverse modePrimary emphasis
ADIFORyeslimited or secondaryforward source transformation
Tapenadeyesyesscientific tangent and adjoint code
TensorFlowlimited through APIsyestensor reverse mode
PyTorchincreasing supportyesdynamic reverse mode
JAXyesyescomposable JVP and VJP transformations
Zygotemostly reverseyesJulia pullbacks
Enzymeyes in some pathsyescompiler-level reverse mode
Tinygradminimalyessmall reverse-mode engine

Forward mode propagates tangents with the computation. Reverse mode records or reconstructs enough of the computation to propagate adjoints backward. Systems differ mainly in how they store, reconstruct, or transform that information.

Source Transformation vs Runtime Tracing

Source transformation produces new code before execution. Runtime tracing or tape systems record what happens during execution.

Source transformation has advantages:

AdvantageReason
compiler visibilityderivative code can be optimized ahead of time
auditabilitygenerated code can be inspected
whole-program analysiscall graphs and data flow can be transformed
HPC integrationworks with existing compilers and build systems

Runtime tracing has different advantages:

AdvantageReason
flexibilityfollows actual execution path
easier interactionworks naturally in notebooks and REPLs
dynamic modelshandles data-dependent structure naturally
lower upfront compilation burdengraph built while running

The cost of source transformation is compiler complexity. The cost of runtime tracing is runtime overhead and weaker whole-program optimization.

Tensor Graphs vs General Programs

TensorFlow, PyTorch, JAX, and Tinygrad mostly differentiate tensor programs. ADIFOR, Tapenade, Zygote, and Enzyme aim closer to general program differentiation.

Tensor graphs are easier to differentiate because the primitive set is controlled. Operations such as matrix multiplication, convolution, reduction, broadcasting, and normalization have known derivative rules.

General programs are harder because they contain:

FeatureDifficulty for AD
mutationold values may be needed in reverse mode
aliasingmultiple names may refer to the same memory
external callsderivative semantics may be unknown
I/Ousually nondifferentiable
dynamic allocationadjoint storage becomes complex
recursionreverse execution needs stack structure
low-level pointersactivity analysis becomes harder

A tensor graph system avoids many of these problems by restricting the differentiable world. A general AD system accepts more programs but must solve more compiler and runtime problems.

Dynamic vs Staged Execution

PyTorch and Tinygrad are dynamic by default. The graph is built as operations execute. JAX and TensorFlow can stage computations for compilation. Enzyme, Tapenade, and ADIFOR work ahead of execution.

Execution styleExamplesMain benefitMain cost
dynamicPyTorch, Tinygradflexibility and debuggingoverhead and fewer global optimizations
staged graphTensorFlow, JAXcompilation and deploymenttracing semantics
source transformationADIFOR, Tapenadeexplicit generated codecomplex tooling
compiler IR transformationEnzyme, Zygotedeep compiler integrationharder debugging

Dynamic execution feels close to ordinary programming. Staged execution gives the system a larger region to optimize. Compiler transformation gives the system even more structural information, but users may need to understand compilation artifacts when something fails.

Treatment of State

State is one of the sharpest dividing lines.

JAX pushes users toward explicit state passing. PyTorch permits mutation but guards many unsafe cases with version counters. Zygote prefers functional code and has historically struggled with mutation. Enzyme must reason about memory at the IR level. Tapenade and ADIFOR transform imperative programs directly, so they must handle state through data-flow and activity analysis.

SystemState model
JAXexplicit, functional state
PyTorchmutable tensors with autograd checks
TensorFlowvariables plus graph semantics
Zygotefunctional style preferred, mutation difficult
Enzymecompiler memory analysis
Tapenadeprocedural source analysis
ADIFORprocedural source analysis
Tinygradsimple tensor object state

Reverse mode over mutation requires either saving old values, recomputing them, or proving they are not needed. This is a core systems problem, not a syntactic inconvenience.

Memory Strategy

Reverse mode needs forward-pass information during the backward pass. Every system must choose a memory strategy.

StrategyUsed byTradeoff
tape storagePyTorch, TensorFlow eager, Tinygradsimple but memory-heavy
checkpointingTapenade, TensorFlow, PyTorch, JAXsaves memory, adds recomputation
compiler recomputationEnzyme, JAX, Zygotecan reduce storage, needs analysis
explicit derivative arraysADIFOR forward modepredictable but can be large
generated adjoint storageTapenade, Enzymeefficient when analysis succeeds

Memory is often the limiting factor in reverse-mode AD. A derivative program can be mathematically correct but unusable if it stores too much intermediate state.

Custom Derivatives

All practical AD systems need custom derivative rules.

They are needed when:

SituationExample
primitive is opaqueexternal C or Fortran function
default derivative is unstablelog-sum-exp, softmax, normalization
default derivative is inefficientlinear solvers, eigendecompositions
operation is approximateiterative solver, projection, quantization
desired gradient differs from mathematical derivativestraight-through estimator

The mechanism differs by system.

SystemCustom rule mechanism
TensorFlowtf.custom_gradient
PyTorchtorch.autograd.Function
JAXcustom_jvp, custom_vjp
ZygoteChainRules
Enzymerules and annotations for external functions
Tapenadeuser-supplied derivative routines
ADIFORderivative specifications and transformed subroutines
Tinygradoperation-level backward definitions

Custom gradients are powerful but dangerous. They create a trusted boundary. Once a user supplies a rule, the AD system usually assumes it is correct.

Performance Model

Performance depends on representation, mode, compiler access, and memory behavior.

System familyPerformance strengthPerformance risk
source transformationcompiler-optimized generated codecode size and build complexity
tensor tapesimple reverse executionmemory and Python overhead
staged tensor compilerfusion and accelerator optimizationtracing and recompilation cost
compiler IR ADlow-level optimization and multi-language supportaliasing and IR complexity
minimal dynamic engineclarity and low conceptual overheadlimited kernel and distributed optimization

There is no universally best architecture. A small neural network experiment, a production inference-training pipeline, a Fortran climate model, and a differentiable C++ simulator impose different constraints.

Architectural Lessons

The comparison suggests several durable design lessons.

First, AD systems are compiler systems even when they present themselves as libraries. They must analyze dependencies, transform programs, manage memory, and generate backward computations.

Second, reverse mode is a storage problem as much as a calculus problem. The derivative formulas are local and simple. The hard part is preserving exactly the values needed in the backward pass.

Third, purity helps. Functional programs are easier to differentiate, batch, compile, and parallelize. Mutation can be supported, but it raises the cost of analysis.

Fourth, restricting the primitive set improves reliability. Tensor frameworks succeed partly because they differentiate a controlled operation vocabulary rather than the whole host language.

Fifth, interoperability matters. Enzyme’s IR-level approach and Tapenade’s source-level approach address the same practical need: differentiating code that already exists outside machine learning frameworks.

Choosing an AD System

A practical selection can be framed by the program being differentiated.

Program typeSuitable systems
legacy Fortran simulationADIFOR, Tapenade, Enzyme
C or C++ numerical kernelTapenade, Enzyme
Python deep learning modelPyTorch, TensorFlow, JAX
functional array programJAX
Julia scientific codeZygote, Enzyme, other Julia AD systems
educational autograd engineTinygrad
accelerator-heavy tensor workloadTensorFlow, JAX, PyTorch
compiler researchEnzyme, Zygote, JAX internals

The choice is architectural. It depends less on the derivative rule for multiplication and more on program representation, state model, compiler access, and deployment environment.

Summary

The major AD systems form a spectrum.

At one end, Tinygrad and PyTorch expose dynamic graph reverse mode in a direct user-facing style. TensorFlow and JAX stage tensor programs for optimization and accelerator execution. Zygote moves AD into a high-level language IR. Tapenade and ADIFOR represent the classical source-transformation tradition for scientific codes. Enzyme lowers AD into the compiler backend.

All of them implement the chain rule. Their differences show where each system chooses to locate the chain rule: in source code, in a runtime tape, in a tensor graph, in a functional IR, or in compiler IR.