Tangent Propagation
Forward mode automatic differentiation computes derivatives by propagating tangent values alongside ordinary values. The ordinary value is called the primal. The derivative...
219 notes
Forward mode automatic differentiation computes derivatives by propagating tangent values alongside ordinary values. The ordinary value is called the primal. The derivative...
This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:
Automatic differentiation is easiest to define for pure functions. A pure function behaves like a mathematical mapping: it consumes inputs, produces outputs, and has no...
Physics-informed models combine data fitting with equations from physics or applied mathematics. The model is trained not only to match observed samples, but also to satisfy...
Automatic differentiation began as a numerical technique for computing gradients of scalar functions.
A minimal automatic differentiation engine can compute correct gradients on small programs. A production system must survive long-running workloads, large tensors, distributed...
Automatic differentiation works naturally on pure mathematical functions:
Automatic differentiation works naturally on pure mathematical functions:
Automatic differentiation is a method for computing derivatives by transforming programs into derivative-propagating computations. It does not approximate derivatives...
Forward mode automatic differentiation appears in many numerical systems where directional derivatives, local sensitivities, or small parameter sets are important. This...
Reverse mode automatic differentiation is the mathematical and systems basis of backpropagation. In deep learning, the objective is usually a scalar loss depending on many...
Automatic differentiation is deeply connected to functional programming and lambda calculus. Programs can be viewed as mathematical functions, and differentiation can be...
Higher-order automatic differentiation faces a fundamental problem: derivative structure grows combinatorially with order.
Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how...
Automatic differentiation interacts deeply with type systems because differentiation changes the structure of computation. A derivative operator maps one function into another...
Reinforcement learning studies learning systems that act in an environment. Unlike supervised learning, the training signal is not a target label for each input. The model...
Probabilistic programming represents uncertainty using executable probabilistic models. A probabilistic program defines a distribution rather than only a deterministic computation.
Differentiable systems architecture extends automatic differentiation beyond isolated functions and neural network layers. The central idea is to treat larger systems as...
Distributed gradient computation appears when a differentiable program no longer fits comfortably on one device or one machine. The reason may be model size, data volume,...
Automatic differentiation systems are usually trusted because they implement mathematically established rules such as the chain rule, product rule, and linearization of...
The preceding sections described automatic differentiation through algebraic, categorical, logical, and denotational models. These viewpoints converge on one central idea:
An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or...
The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different...
The systems in this chapter show that automatic differentiation is not one implementation technique. It is a family of program transformations. Each system chooses a different...
A differentiable subprogram is a program fragment that can participate in derivative propagation as a coherent unit. Instead of differentiating an entire application...
Automatic differentiation can be understood as a transformation from one program into another program.
Many real-world Jacobians are sparse. Most derivative entries are zero because outputs depend only on small subsets of inputs.
Checkpointing is a technique for reducing the memory cost of reverse mode automatic differentiation by selectively storing intermediate states and recomputing missing values...
Automatic differentiation is deeply connected to functional programming and lambda calculus. Programs can be viewed as mathematical functions, and differentiation can be...
Perturbation confusion is a correctness bug that appears in nested automatic differentiation, especially nested forward mode. It happens when two derivative computations...
Programs do not only branch between valid computations. They also fail, stop early, raise exceptions, return sentinel values, or enter undefined numerical regions. These...
Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce...
Swift became an important experiment in language-integrated automatic differentiation because it attempted to make differentiation a core compiler feature rather than a...
Meta-learning studies systems that improve how they learn. Instead of only optimizing model parameters for one task, a meta-learning method optimizes some part of the learning...
Robotics and control systems interact with the physical world through sensing, estimation, planning, and actuation. Automatic differentiation is important because modern...
A hybrid symbolic-numeric system combines discrete symbolic reasoning with continuous numerical computation. In the context of automatic differentiation, it means a pipeline...
Modern automatic differentiation systems are built around accelerator hardware. GPUs and TPUs provide enormous throughput for tensor operations, making large-scale...
Automatic differentiation began as a transformation applied to numerical programs. A differentiable programming language instead treats differentiation as a native semantic...
Operational semantics explains how automatic differentiation executes. Denotational semantics explains what differentiable programs mean.
Performance benchmarking measures whether an automatic differentiation engine is fast, memory-efficient, and scalable under realistic workloads. It also protects the engine...
Tinygrad is a small deep learning framework centered around a minimal reverse-mode automatic differentiation engine. It was created by entity"people","George...
Tinygrad is a small deep learning framework centered around a minimal reverse-mode automatic differentiation engine. It was created by entity"people","George...
Differentiation describes how a function changes locally. A Taylor expansion extends this idea by approximating a function with a polynomial around a point.
Automatic differentiation became important because derivatives are required everywhere numerical models are optimized, controlled, calibrated, or analyzed. Once a system can...
Forward mode and reverse mode propagate different kinds of objects.
A pure computation is easier to differentiate because every output is determined only by its explicit inputs. There is no hidden state, no external mutation, and no dependence...
Automatic differentiation computes derivatives exactly with respect to the executed floating point program. This distinguishes AD from numerical differentiation, which...
Forward mode automatic differentiation computes Jacobian-vector products:
Reverse mode automatic differentiation is computationally efficient for scalar-output functions, but it has a major systems cost: it needs information from the forward pass...
Automatic differentiation can be described operationally through dual numbers and computational graphs. It can also be described abstractly using category theory.
Higher-order derivatives contain rich geometric information, but naïve computation quickly becomes impractical.
A stateful system is a program whose output depends not only on its explicit inputs, but also on stored state. The state may live in variables, objects, arrays, files, random...
The singular value decomposition SVD is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares,...
Julia was designed for high-performance technical computing. It combines interactive syntax with a compiler capable of specializing code aggressively based on types. This...
An implicit layer defines its output as the solution of an equation, not as a fixed sequence of explicit operations. Instead of computing
Signal processing studies how information is represented, transformed, filtered, compressed, reconstructed, and estimated from signals. A signal may be a time series, an...
A differentiable operating system is an execution environment whose resource-management decisions can be optimized using gradients or gradient-like feedback. Instead of...
Automatic differentiation is usually described as a transformation of programs or computational graphs. In real systems, it is also a parallel execution problem. Large...
Quantum computation introduces a computational model fundamentally different from classical programs.
Automatic differentiation systems are trusted infrastructure. Scientific computing, machine learning, optimization, simulation, and control systems depend on gradients being...
A custom gradient gives the user direct control over the backward rule of an operation. The forward computation still produces an ordinary value, but the derivative no longer...
Enzyme is a compiler-based automatic differentiation system for LLVM and MLIR. Instead of differentiating source code directly, or recording tensor operations at runtime,...
Enzyme is a compiler-based automatic differentiation system for LLVM and MLIR. Instead of differentiating source code directly, or recording tensor operations at runtime,...
Automatic differentiation developed from a simple observation: a numerical computation already contains the structure needed to compute its derivative. The program evaluates...
Linearization is the operation of replacing a nonlinear function by its best local linear approximation at a chosen point. Automatic differentiation can be understood as a...
Automatic differentiation operates on computations, but computations execute inside a memory model. Variables occupy storage locations, arrays are mutated, buffers are reused,...
Automatic differentiation is fundamentally a computational technique. Its practical importance comes from the fact that derivatives can often be computed with asymptotic cost...
So far, forward mode has propagated a single tangent direction:
A Wengert list is a linear representation of a computation in which every intermediate result is assigned to a unique variable. It is one of the earliest and most influential...
Dual numbers and hyper-dual numbers are special cases of a broader algebraic structure called a differential algebra. This framework abstracts differentiation away from...
Taylor mode automatic differentiation computes derivatives by propagating truncated Taylor series through a program.
A non-smooth program contains operations where the derivative is undefined, discontinuous, set-valued, or unstable under small perturbations. These programs arise naturally in...
Eigenvalue problems are fundamental in numerical analysis, optimization, physics, graph methods, control theory, and machine learning. They are also among the most subtle...
Attention is a sequence operation that lets each position read information from other positions. Instead of compressing the whole past into one recurrent hidden state,...
Computational finance uses numerical models to price contracts, measure risk, and optimize portfolios. Automatic differentiation is useful because most financial computations...
A differentiable compiler is a compilation system that supports gradient propagation through compilation decisions, generated programs, or execution behavior. Instead of...
Automatic differentiation systems are often assumed to be deterministic. Given identical inputs, identical parameters, and identical code, many users expect identical...
Classical automatic differentiation computes derivatives of deterministic programs.
Automatic differentiation transforms programs. A fundamental semantic question therefore arises:
An automatic differentiation engine becomes useful only after it supports a sufficiently rich set of primitive operations. The collection of these primitives is the operator...
Zygote is a source-to-source reverse-mode automatic differentiation system for the Julia programming language. It was designed to differentiate high-level Julia code directly,...
Zygote is a source-to-source reverse-mode automatic differentiation system for the Julia programming language. It was designed to differentiate high-level Julia code directly,...
Derivative computation is not only a mathematical problem. It is also a numerical and systems problem. A derivative method must answer three questions simultaneously:
A computational graph represents a calculation as nodes and edges. Nodes represent operations or values. Edges represent data dependencies. Automatic differentiation uses this...
Loops express repeated computation. Recurrence relations express the same idea mathematically: each state is computed from one or more earlier states.
Mixed-mode differentiation combines forward accumulation and reverse accumulation in the same derivative computation. It is used when neither pure forward mode nor pure...
Forward mode automatic differentiation has a simple cost model. It evaluates the original program and, at the same time, evaluates the tangent program. Each primitive...
Most reverse mode automatic differentiation systems require a mechanism for recording the forward computation so that the reverse pass can later traverse it backward. This...
Dual numbers compute first derivatives exactly. Truncated polynomial algebras extend this to higher-order derivatives, but practical higher-order differentiation introduces an...
Nested automatic differentiation means applying automatic differentiation inside another automatic differentiation computation.
A piecewise differentiable function is built from several differentiable pieces joined by boundaries. Each piece has an ordinary derivative inside its region. At the...
Matrix factorizations rewrite a matrix into structured factors. They are used because the factors make later computations cheaper, more stable, or easier to interpret. In...
Python became the dominant language for modern machine learning and differentiable computing because it combines a simple programming model with access to high-performance...
Sequence models process ordered data. The input is not one independent vector, but a series:
Molecular simulation models the behavior of atoms and molecules using physical interaction laws. Automatic differentiation is important because many molecular methods require...
Differentiable search and retrieval systems integrate information access into gradient-based learning. Instead of treating retrieval as an external symbolic operation, the...
Gradient-based optimization relies on propagating derivative information through many layers, time steps, or computational transformations. In deep systems, these gradients...
Classical neural networks apply a finite sequence of transformations:
Automatic differentiation becomes substantially more difficult once programs contain higher-order functions.
Memory management is the main systems problem in reverse mode automatic differentiation. The derivative rules are usually small. The hard part is deciding which primal values,...
JAX is an automatic differentiation and array programming system for Python. It combines NumPy-like syntax with composable program transformations. Its core transformations...
JAX is an automatic differentiation and array programming system for Python. It combines NumPy-like syntax with composable program transformations. Its core transformations...
Automatic differentiation computes derivatives by applying the chain rule to the operations of a program. The input is ordinary code that computes a value. The output is code,...
The chain rule is the central theorem behind automatic differentiation. Every useful AD algorithm is a disciplined way of applying the chain rule to a program.
Control flow determines which operations a program executes. Straight-line programs have a fixed sequence of operations, but ordinary programs contain branches, loops,...
Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.
The natural output of forward mode automatic differentiation is a Jacobian-vector product. Instead of constructing the full Jacobian matrix explicitly, forward mode computes...
Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse...
Dual numbers capture first-order derivatives because the infinitesimal element satisfies
Reverse mode is efficient for scalar-output functions because it propagates one adjoint backward through the computation and produces a full gradient. For
A dynamic graph is a computation graph built while the program runs. Its structure depends on ordinary runtime values: branches, loop counts, recursive calls, tensor shapes,...
Linear algebra primitives are tensor operations with algebraic structure: matrix multiplication, triangular solves, factorizations, inverses, determinants, norms, and spectral...
Neural network training is the repeated application of three operations: evaluate a model, differentiate a scalar loss, and update parameters. Automatic differentiation...
Computational fluid dynamics studies fluid motion by solving discretized forms of the governing equations. Automatic differentiation enters CFD when we want gradients of...
A differentiable physics engine computes gradients of physical simulation outputs with respect to inputs, parameters, or control signals. Instead of treating simulation as a...
Reverse-mode automatic differentiation trades computation for memory. To compute gradients efficiently, the backward pass requires access to intermediate values produced...
Many systems evolve continuously over time rather than through discrete layers. A state variable changes according to a differential equation:
Cartesian differential categories model differentiation in categories with products. Differential categories generalize this idea further by shifting attention from cartesian...
A tape is an append-only record of the operations executed during the forward pass. Reverse mode uses the tape to replay derivative rules backward.
PyTorch Autograd is a dynamic reverse-mode automatic differentiation system. It records tensor operations as they execute, builds a computation graph at runtime, and then...
PyTorch Autograd is a dynamic reverse-mode automatic differentiation system. It records tensor operations as they execute, builds a computation graph at runtime, and then...
Symbolic differentiation computes derivatives by manipulating expressions. The input is a formula. The output is another formula.
The gradient is enough when a function has many inputs and one scalar output. More general programs need more general derivative objects. Two of the most important are the...
A dependency graph describes how values in a computation depend on earlier values. Automatic differentiation operates on these dependencies.
Forward accumulation is the forward-mode form of automatic differentiation. It propagates derivative information in the same order as ordinary program evaluation. Each...
Forward mode automatic differentiation works by replacing each primitive operation with an extended operation on pairs:
Reverse mode automatic differentiation fundamentally computes vector-Jacobian products. The gradient of a scalar function is a special case of this more general operation.
Dual numbers provide an algebraic mechanism for differentiation, but they also have a precise geometric meaning. A dual number represents a point together with an...
A Hessian-vector product computes
Recursion is control flow where a function calls itself. In automatic differentiation, recursion behaves like a loop with a call stack. Each recursive call contributes one...
Broadcasting is the rule system that allows tensor operations between arrays of different shapes without explicitly materializing expanded copies. It is one of the most...
Differentiable programming treats differentiation as a general programming-language feature. A program can contain numerical kernels, control flow, data structures, solvers,...
Backpropagation is reverse mode automatic differentiation applied to neural networks. In most machine learning writing, the term refers to the whole training procedure: run a...
An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.
Differentiable rendering is the process of computing derivatives of rendered images with respect to scene parameters. A renderer becomes part of the computational graph rather...
Floating point systems represent numbers within a finite range. When a computed value exceeds the largest representable magnitude, overflow occurs. When a value becomes too...
An optimization layer is a program component whose output is the solution of an optimization problem. Instead of computing
Algebraic semantics describes differentiation through derivations, tangent maps, and linear structure. Categorical semantics goes further. It studies differentiation as a...
A graph representation makes the structure of a differentiated computation explicit. In reverse mode, this structure is required because the backward pass must know which...
TensorFlow Autograd refers to TensorFlow’s automatic differentiation system, mainly exposed through tf.GradientTape. It is a reverse-mode AD system designed for tensor...
TensorFlow Autograd refers to TensorFlow’s automatic differentiation system, mainly exposed through tf.GradientTape. It is a reverse-mode AD system designed for tensor...
Numerical differentiation estimates derivatives by evaluating a function at nearby input values. It treats the function as a black box. The method does not need access to the...
Automatic differentiation is usually applied to functions with many inputs and many outputs. The calculus needed for this setting is multivariate calculus: the study of how a...
Intermediate variables are the named values created between program inputs and program outputs. They make automatic differentiation mechanical.
Automatic differentiation reduces differentiation to a finite collection of elementary operations. Every program, regardless of complexity, is decomposed into primitive...
Dual numbers give forward mode automatic differentiation a compact algebraic form. Instead of storing a value and a tangent as two unrelated fields, we package them into one...
Reverse mode automatic differentiation operates on a computational graph. The forward pass evaluates the graph from inputs to outputs. The reverse pass traverses the same...
The defining feature of dual numbers is the existence of a nonzero element whose square vanishes:
For a scalar function
A loop repeats a computation until a condition fails or a fixed iteration count is reached. In automatic differentiation, loops are important because many numerical algorithms...
Tensor operations generalize scalar, vector, and matrix operations to arrays with arbitrary rank. In automatic differentiation, a tensor is usually treated as a typed array...
Functional programming languages provide a natural semantic foundation for automatic differentiation. Programs are expressed as compositions of functions, immutable values,...
Stochastic optimization studies optimization when the objective is accessed through samples, noisy estimates, or partial observations. In machine learning, this is the normal...
Sensitivity analysis studies how changes in inputs affect the outputs of a system. In differential equations, optimization, simulation, and machine learning, the main object...
A differentiable database is a data system whose operations participate in gradient-based optimization. Instead of treating storage and querying as external infrastructure,...
Reverse mode automatic differentiation computes gradients by propagating adjoint values backward through a computational graph. In exact arithmetic, the reverse accumulation...
A solver is a program that computes a value by search, iteration, or factorization. Instead of evaluating a closed-form expression, it finds a value that satisfies a condition.
Automatic differentiation is often introduced operationally. A program executes elementary operations, and derivative information propagates alongside the computation. This...
Reverse mode automatic differentiation computes derivatives by traversing the program backward after evaluation. Unlike forward mode, which propagates tangents alongside...
Tapenade is a source-transformation automatic differentiation system developed at INRIA. Like ADIFOR, it takes an existing program and produces a new differentiated program....
Tapenade is a source-transformation automatic differentiation system developed at INRIA. Like ADIFOR, it takes an existing program and produces a new differentiated program....
A derivative measures how an output changes when an input changes. That sentence is simple, but it is one of the main ideas behind numerical computing, optimization, machine...
Automatic differentiation begins with a simple object: a function.
A straight-line program is the simplest model of computation used in automatic differentiation. It is a program with a fixed sequence of assignments, no branches, no loops,...
Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full...
Forward mode automatic differentiation computes derivatives by carrying two values through a program at the same time: the ordinary value and its tangent. The ordinary value...
Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a...
Dual numbers give the cleanest algebraic model of forward mode automatic differentiation. They extend ordinary real numbers with a formal infinitesimal part. Instead of...
First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is...
A conditional is a program construct that chooses one computation among several possible computations. In ordinary code, this is written as if, else, switch, case, pattern...
Matrix calculus is the notation and rule system used to differentiate functions whose inputs, outputs, or intermediate values are vectors, matrices, or tensors. Automatic...
Gradient descent is the basic optimization procedure behind much of modern machine learning. It is simple enough to state in one line, but rich enough to expose many of the...
Differential equations are one of the main reasons automatic differentiation matters in scientific computing. Many scientific models are not written as closed-form functions....
An end-to-end differentiable pipeline is a system whose final objective can send derivative information backward through every trainable or tunable stage of computation....
Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the...
Many programs do not compute their output by applying a fixed sequence of explicit operations. Instead, they define the output as the solution of another problem.
Automatic differentiation is often described by a simple rule:
A minimal forward mode automatic differentiation engine has one job: evaluate a program while carrying both a value and its derivative. The engine does not build a graph. It...
ADIFOR, short for Automatic Differentiation of Fortran, is one of the classical source-transformation systems for automatic differentiation. It was designed for numerical...
ADIFOR, short for Automatic Differentiation of Fortran, is one of the classical source-transformation systems for automatic differentiation. It was designed for numerical...
Sparse and structured differentiation studies how to compute derivatives without materializing dense derivative objects. Many real systems have enormous Jacobians and...
Automatic differentiation works naturally on pure mathematical functions:
Auto Diff book notes exported from ChatGPT, organized into 22 chapters.
ADIFOR, short for Automatic Differentiation of Fortran, is one of the classical source-transformation systems for automatic differentiation. It was designed for numerical...
Automatic differentiation works naturally on pure mathematical functions:
ADIFOR, short for Automatic Differentiation of Fortran, is one of the classical source-transformation systems for automatic differentiation. It was designed for numerical...
A minimal forward mode automatic differentiation engine has one job: evaluate a program while carrying both a value and its derivative. The engine does not build a graph. It...
Automatic differentiation is often described by a simple rule:
Many programs do not compute their output by applying a fixed sequence of explicit operations. Instead, they define the output as the solution of another problem.
Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the...
Sparse and structured differentiation studies how to compute derivatives without materializing dense derivative objects. Many real systems have enormous Jacobians and...
An end-to-end differentiable pipeline is a system whose final objective can send derivative information backward through every trainable or tunable stage of computation....
Differential equations are one of the main reasons automatic differentiation matters in scientific computing. Many scientific models are not written as closed-form functions....
Gradient descent is the basic optimization procedure behind much of modern machine learning. It is simple enough to state in one line, but rich enough to expose many of the...
Lisp is one of the natural homes of automatic differentiation. It treats programs as data, has a simple expression syntax, and supports macro systems that can transform code...
Source transformation is an implementation strategy for automatic differentiation in which a program that computes a function is rewritten into another program that computes...
Automatic differentiation can be performed before a program runs, while it runs, or in a staged phase between the two.
Matrix calculus is the notation and rule system used to differentiate functions whose inputs, outputs, or intermediate values are vectors, matrices, or tensors. Automatic...
Kernel fusion combines several small operations into one larger executable unit.
A conditional is a program construct that chooses one computation among several possible computations. In ordinary code, this is written as if, else, switch, case, pattern...
Memory planning determines where values are stored, how long they remain alive, and when storage can be reused.
First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is...
Staging is the separation of a program into phases.
Dual numbers give the cleanest algebraic model of forward mode automatic differentiation. They extend ordinary real numbers with a formal infinitesimal part. Instead of...
Tracing is an implementation strategy where an AD system observes a program while it runs and records the operations that occur.
Rust is an attractive language for automatic differentiation because it combines low-level performance with strong static guarantees. It gives the programmer control over...
Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a...
A graph intermediate representation models a program as nodes and edges.
Forward mode automatic differentiation computes derivatives by carrying two values through a program at the same time: the ordinary value and its tangent. The ordinary value...
Static single assignment form, or SSA, is an intermediate representation where each variable is assigned exactly once.
C and C++ are important targets for automatic differentiation because much scientific, engineering, graphics, finance, and machine learning infrastructure is written in these...
Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full...
An intermediate representation, or IR, is the internal program form used by a compiler or AD system after parsing and before final code generation.
A straight-line program is the simplest model of computation used in automatic differentiation. It is a program with a fixed sequence of assignments, no branches, no loops,...
Operator overloading implements automatic differentiation by changing the meaning of ordinary arithmetic operations for special numeric objects.
Automatic differentiation begins with a simple object: a function.
Source transformation is an implementation strategy for automatic differentiation in which a program that computes a function is rewritten into another program that computes...
Lisp is one of the natural homes of automatic differentiation. It treats programs as data, has a simple expression syntax, and supports macro systems that can transform code...
A derivative measures how an output changes when an input changes. That sentence is simple, but it is one of the main ideas behind numerical computing, optimization, machine...