PyTorch Autograd is a dynamic reverse-mode automatic differentiation system. It records tensor operations as they execute, builds a computation graph at runtime, and then...
PyTorch Autograd is a dynamic reverse-mode automatic differentiation system. It records tensor operations as they execute, builds a computation graph at runtime, and then traverses that graph backward to compute gradients.
Its defining feature is dynamic graph construction. The graph follows the actual Python execution path taken by the program. This makes PyTorch convenient for models with conditionals, loops, recursion, variable-length inputs, and debugging workflows.
Dynamic Computation Graphs
In PyTorch, tensors can carry gradient-tracking metadata. When an operation is applied to tensors with requires_grad=True, PyTorch records the operation in a graph.
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x * x + torch.sin(x)
y.backward()
print(x.grad)Here y.backward() starts reverse accumulation from y. Since y is scalar, the initial adjoint is implicitly 1. PyTorch then applies local backward rules in reverse topological order.
The graph is rebuilt on every forward pass. This is often called define-by-run. The Python program defines the graph by running.
Tensors, Leaves, and Gradients
A tensor is a leaf tensor when it is created directly by the user and not as the result of a differentiable operation. Model parameters are usually leaf tensors.
w = torch.randn(10, requires_grad=True)
b = torch.zeros(10, requires_grad=True)
y = w * 2 + b
loss = y.sum()
loss.backward()After backward, gradients are stored in w.grad and b.grad. Intermediate tensors normally do not store .grad unless explicitly requested with retain_grad().
This distinction matters because optimizers update leaf parameters, not arbitrary intermediate values.
Reverse Mode and Vector-Jacobian Products
PyTorch Autograd computes vector-Jacobian products. For a computation
the backward pass propagates an upstream cotangent
and computes
For scalar losses, this gives gradients with respect to many parameters efficiently. This is the standard shape of deep learning training: many inputs and parameters, one scalar objective.
For a matrix multiplication
the backward rules are:
PyTorch applies these rules through registered backward functions attached to operations.
The Autograd Graph
Each differentiable tensor produced by an operation has a grad_fn. This object points to the backward function for the operation.
x = torch.tensor(2.0, requires_grad=True)
y = x * x
print(y.grad_fn)The graph contains backward nodes and saved tensors needed for gradient computation. During backward execution, PyTorch uses these saved values to evaluate derivative rules.
For example, multiplication needs the original operands. Sine needs the original input because the derivative uses cosine.
z = torch.sin(x)Backward uses:
The value of x from the forward pass must therefore be available during the backward pass.
Gradient Accumulation
PyTorch accumulates gradients into .grad fields. It does not overwrite them automatically.
loss.backward()
loss.backward()Calling backward twice without clearing gradients adds the second gradient to the first. Training loops normally reset gradients before each step:
optimizer.zero_grad()
loss = model(x).sum()
loss.backward()
optimizer.step()This accumulation behavior is deliberate. It supports minibatch accumulation and cases where a loss is assembled from several backward calls.
No-Grad and Inference Mode
PyTorch provides mechanisms to turn gradient tracking off.
with torch.no_grad():
y = model(x)torch.no_grad() prevents graph construction. This reduces memory use and improves inference speed.
torch.inference_mode() is stricter and can provide additional performance benefits when tensors are used only for inference.
These modes are important because dynamic graph construction has overhead. Training needs the graph. Evaluation usually does not.
In-Place Operations
In-place mutation is a central source of complexity in PyTorch Autograd.
x.add_(1.0)An in-place operation modifies storage directly. This can conflict with backward computation if the previous value is needed by a gradient rule.
PyTorch uses version counters to detect many unsafe mutations. If a tensor saved for backward is modified in place before the backward pass, PyTorch usually raises an error.
The rule is practical: use in-place operations only when their effect on autograd is clear. Memory savings can be real, but incorrect mutation breaks the assumptions of reverse mode.
Higher-Order Gradients
PyTorch can compute higher-order derivatives by constructing a graph for the backward computation.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]The key option is create_graph=True. It tells PyTorch to make the gradient computation itself differentiable.
Higher-order differentiation increases memory use and exposes more unsupported or numerically unstable operations. It is useful for meta-learning, implicit differentiation, physics-informed models, and optimization layers.
backward vs autograd.grad
PyTorch has two common gradient APIs.
| API | Behavior |
|---|---|
loss.backward() | accumulates gradients into .grad fields |
torch.autograd.grad() | returns gradients directly without necessarily filling .grad |
backward() is convenient for standard neural network training.
autograd.grad() is better when gradients are values inside a larger computation, such as Hessian-vector products, gradient penalties, implicit layers, and nested differentiation.
Custom Autograd Functions
Users can define custom forward and backward rules with torch.autograd.Function.
class Square(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x * x
@staticmethod
def backward(ctx, grad_out):
(x,) = ctx.saved_tensors
return grad_out * 2 * xUsage:
y = Square.apply(x)The ctx object stores values needed by the backward rule. Custom functions are useful for external kernels, fused operations, numerical stabilization, and nonstandard gradient definitions.
They also create a correctness boundary. PyTorch trusts the custom backward rule. A wrong rule gives wrong gradients without necessarily producing an error.
Strengths
PyTorch Autograd is strong because it matches ordinary Python programming. Users can use native conditionals, loops, debugging tools, print statements, and object-oriented model definitions.
The dynamic graph model is also effective for research. Model structures can change between examples. Sequence lengths can vary. Control flow can depend on data.
PyTorch integrates autograd with GPU kernels, neural network modules, optimizers, distributed training, mixed precision, and compilation tools. Its AD system is not isolated; it sits inside a full machine learning runtime.
Limitations
PyTorch Autograd differentiates PyTorch tensor operations, not arbitrary Python effects. Python list mutation, file I/O, control decisions, and calls into non-PyTorch libraries do not become differentiable merely because they occur during the forward pass.
Reverse mode stores intermediate values, so memory use can be high. Long unrolled computations, large activations, and higher-order gradients can require checkpointing or recomputation.
Dynamic graph construction has runtime overhead. Compilation tools such as TorchScript, FX, TorchDynamo, and torch.compile reduce this overhead in many cases, but they introduce their own tracing and graph-capture semantics.
In-place operations, aliasing, views, and mutation require careful handling. These features make PyTorch ergonomic, but they complicate the gradient system.
Historical Role
PyTorch Autograd made dynamic reverse-mode AD the dominant interactive model for deep learning research. TensorFlow originally emphasized static graphs. PyTorch showed that a define-by-run system could be flexible, debuggable, and fast enough for large-scale training.
Its historical importance lies in the combination of three ideas: tensor reverse mode, dynamic graph construction, and Python-native usability. This combination changed how many researchers wrote differentiable programs.