An automatic differentiation engine is the system that records numerical operations and computes derivatives from them.
An automatic differentiation engine is the system that records numerical operations and computes derivatives from them. In PyTorch, this system is called autograd. It is responsible for building the backward graph during the forward pass and executing gradient computation when backward() is called.
Automatic differentiation is different from symbolic differentiation and numerical differentiation. Symbolic differentiation manipulates formulas. Numerical differentiation estimates derivatives using small perturbations. Automatic differentiation evaluates exact derivative rules for the operations actually performed by the program.
Why Automatic Differentiation Matters
Deep learning models are large compositions of simple operations. A model may contain matrix multiplications, convolutions, attention operations, nonlinearities, normalization layers, indexing operations, losses, and reductions.
Writing derivatives for the full model by hand would be slow and error-prone. Automatic differentiation solves this by requiring derivative rules only for primitive operations. The engine combines those local rules using the chain rule.
For example, a training step may look simple:
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()Behind this code, the engine records the operations used to produce loss, traverses them backward, computes vector-Jacobian products, and fills parameter gradients.
Autograd in PyTorch
PyTorch autograd works with tensors. A tensor participates in gradient tracking when
requires_grad=TrueFor example:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x
print(y)
print(y.requires_grad)
print(y.grad_fn)The tensor y has a grad_fn because it was produced by operations involving x. That grad_fn is a node in the backward graph.
Calling
y.backward()computes
At , this is :
y.backward()
print(x.grad) # tensor(7.)Leaf Tensors and Non-Leaf Tensors
A leaf tensor is created directly by the user and is not the result of an operation tracked by autograd.
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
y = w * xHere x and w are leaf tensors. The tensor y is not a leaf tensor because it was produced by multiplication.
print(x.is_leaf) # True
print(w.is_leaf) # True
print(y.is_leaf) # FalseGradients are stored by default only on leaf tensors:
y.backward()
print(x.grad) # tensor(3.)
print(w.grad) # tensor(2.)
print(y.grad) # usually NoneIntermediate tensors are still used in the backward pass. Their gradients are usually discarded unless explicitly retained.
x = torch.tensor(2.0, requires_grad=True)
a = x ** 2
a.retain_grad()
y = 3 * a
y.backward()
print(a.grad) # tensor(3.)The Dynamic Graph
PyTorch builds the computation graph dynamically. The graph is created as Python code executes.
x = torch.tensor(2.0, requires_grad=True)
if x.item() > 0:
y = x ** 2
else:
y = -x
y.backward()
print(x.grad) # tensor(4.)Only the branch that actually runs is recorded. This makes PyTorch natural for models with loops, conditionals, variable-length inputs, and complex control flow.
Each forward pass builds a fresh graph. This is why ordinary training loops can reuse the same model code repeatedly:
for X, y in dataloader:
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()Each iteration constructs a new graph for that batch, then frees it after backward.
Backward Graph Nodes
Every differentiable operation has a backward rule. In PyTorch, the output tensor stores a reference to the function that created it.
x = torch.tensor(2.0, requires_grad=True)
a = x + 1
b = a ** 2
c = b.mean()
print(a.grad_fn)
print(b.grad_fn)
print(c.grad_fn)You may see names such as:
<AddBackward0>
<PowBackward0>
<MeanBackward0>These names indicate the backward operations that will be used during gradient computation.
The graph points backward from outputs to inputs. Starting at loss, autograd follows grad_fn links and applies the corresponding backward rules.
Saved Tensors
Backward rules often need values from the forward pass.
For example, for
the backward rule needs , because
For ReLU, the backward rule needs to know which inputs were positive. For softmax and cross-entropy, the backward rule needs intermediate probabilities or logits. For attention, backward needs several intermediate tensors related to queries, keys, values, and attention probabilities.
The autograd engine stores such values as saved tensors. This is why training consumes more memory than inference.
A simplified view:
forward pass:
compute outputs
save needed tensors
backward pass:
load saved tensors
compute gradients
release graphWhen memory is limited, techniques such as gradient checkpointing store fewer tensors and recompute them during backward.
Gradient Accumulation Semantics
PyTorch accumulates gradients into the .grad field.
x = torch.tensor(2.0, requires_grad=True)
y1 = x ** 2
y1.backward()
print(x.grad) # tensor(4.)
y2 = 3 * x
y2.backward()
print(x.grad) # tensor(7.)The second backward call adds to the existing gradient.
This behavior supports cases where one wants to accumulate gradients across multiple losses or microbatches. In ordinary training, clear gradients before the backward pass:
optimizer.zero_grad()
loss.backward()
optimizer.step()A common memory-efficient pattern is:
optimizer.zero_grad()
for microbatch in microbatches:
loss = compute_loss(microbatch)
loss = loss / len(microbatches)
loss.backward()
optimizer.step()Here gradients are intentionally accumulated before one optimizer step.
Disabling Gradient Tracking
Not every tensor operation needs gradients. Evaluation, metric computation, logging, and target generation often should not build a graph.
Use torch.no_grad() when gradients are unnecessary:
model.eval()
with torch.no_grad():
pred = model(X)Use torch.inference_mode() for inference-only code:
model.eval()
with torch.inference_mode():
pred = model(X)Both reduce memory use by preventing graph construction. inference_mode() can be faster because it disables additional autograd bookkeeping, but it is less flexible if tensors later need to re-enter gradient-tracked computation.
Detaching Tensors
The method .detach() returns a tensor that shares data with the original tensor but has no gradient history.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = y.detach()
print(y.requires_grad) # True
print(z.requires_grad) # FalseThis stops gradient flow through z.
Example:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = y.detach()
loss = 3 * z
print(loss.requires_grad) # FalseDetaching is common in reinforcement learning, teacher-student training, contrastive learning, and target-network methods.
A frequent mistake is detaching accidentally:
h = encoder(x)
h = h.detach()
out = decoder(h)
loss = loss_fn(out, target)
loss.backward()In this code, gradients cannot reach encoder.
backward() and autograd.grad
PyTorch provides two common ways to compute gradients.
The first is backward(), which accumulates gradients into .grad fields:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
y.backward()
print(x.grad) # tensor(12.)The second is torch.autograd.grad, which returns gradients directly:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
grad_x = torch.autograd.grad(y, x)[0]
print(grad_x) # tensor(12.)
print(x.grad) # NoneUse backward() in ordinary training loops. Use autograd.grad when implementing algorithms that need explicit gradient values without accumulating into parameters, such as meta-learning, gradient penalties, implicit differentiation, or Hessian-vector products.
Higher-Order Derivatives
Autograd can compute derivatives of derivatives if asked to build a graph for the gradient computation.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(dy_dx) # tensor(12., grad_fn=<MulBackward0>)
print(d2y_dx2) # tensor(12.)The argument
create_graph=Truetells PyTorch to record the operations used to compute the first derivative.
Higher-order derivatives are more expensive than ordinary gradients. They also retain more graph structure in memory. Use them only when the algorithm requires them.
Custom Autograd Functions
Most models can be built from existing PyTorch operations. Sometimes one needs a custom operation with a custom backward rule.
PyTorch supports this with torch.autograd.Function.
import torch
class Square(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x ** 2
@staticmethod
def backward(ctx, grad_output):
(x,) = ctx.saved_tensors
grad_x = grad_output * 2 * x
return grad_x
x = torch.tensor(3.0, requires_grad=True)
y = Square.apply(x)
y.backward()
print(x.grad) # tensor(6.)The forward method computes the output. The backward method receives the upstream gradient and returns gradients for each input.
Custom functions are useful for special kernels, memory-efficient operations, numerical tricks, or research prototypes. They should be tested carefully with gradient checking.
Gradient Checking for Custom Functions
When writing a custom backward rule, compare it with numerical finite differences.
PyTorch provides gradcheck, which works best with double precision:
import torch
from torch.autograd import gradcheck
class Square(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
ctx.save_for_backward(x)
return x ** 2
@staticmethod
def backward(ctx, grad_output):
(x,) = ctx.saved_tensors
return grad_output * 2 * x
x = torch.randn(3, dtype=torch.double, requires_grad=True)
test = gradcheck(Square.apply, (x,))
print(test)A result of True means the custom backward rule agrees with finite-difference estimates within tolerance.
In-Place Operations
In-place operations modify a tensor directly. Examples include:
x.add_(1)
x.relu_()The underscore convention usually indicates an in-place PyTorch operation.
In-place operations can save memory, but they may break autograd if they overwrite values needed for backward.
x = torch.randn(4, requires_grad=True)
y = x ** 2
x.add_(1.0)
loss = y.sum()
loss.backward()This may fail because the backward rule for x ** 2 needs the original x.
Use out-of-place operations unless memory pressure requires in-place modification and the operation is known to be safe:
x2 = x + 1.0Views, Copies, and Contiguity
Some tensor operations return views. A view shares storage with the original tensor but has different shape or strides.
x = torch.arange(6.0, requires_grad=True)
y = x.view(2, 3)
print(y)Operations such as view, reshape, transpose, permute, squeeze, and unsqueeze may affect layout. Autograd tracks these operations, but in-place modifications on views can be subtle.
A tensor is contiguous when its elements are stored in memory in the order expected by its shape. After permute, a tensor may be non-contiguous:
x = torch.randn(2, 3, 4)
y = x.permute(0, 2, 1)
print(y.is_contiguous()) # FalseSome operations require contiguous memory. Calling .contiguous() creates a contiguous copy:
z = y.contiguous()This matters for performance and for some low-level operations.
Autograd and Modules
nn.Module parameters are tensors wrapped as nn.Parameter. By default, they require gradients.
import torch.nn as nn
layer = nn.Linear(3, 4)
print(layer.weight.requires_grad) # True
print(layer.bias.requires_grad) # TrueAfter a backward pass:
X = torch.randn(8, 3)
Y = layer(X)
loss = Y.sum()
loss.backward()
print(layer.weight.grad.shape)
print(layer.bias.grad.shape)Optimizers read these .grad fields and update the parameter values.
To freeze a layer:
for param in layer.parameters():
param.requires_grad = FalseFrozen parameters do not receive gradients and are not updated if excluded from the optimizer.
Common Autograd Errors
A common error is calling backward() on a tensor with no graph:
x = torch.tensor(2.0)
y = x ** 2
y.backward()The fix:
x = torch.tensor(2.0, requires_grad=True)Another error is calling backward twice through the same graph:
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
y.backward()
y.backward()The graph is freed after the first backward call. To reuse it, specify:
y.backward(retain_graph=True)This is rarely needed in ordinary training.
Another common issue is grad is None. This usually means the tensor did not contribute to the loss, does not require gradients, or is not a leaf tensor whose gradient is retained.
for name, param in model.named_parameters():
if param.grad is None:
print(name, "has no gradient")Mental Model
A useful mental model of autograd is:
forward:
run tensor operations
build graph
save needed values
backward:
start from scalar loss
traverse graph backward
apply local derivative rules
accumulate gradients into leavesThe user writes ordinary tensor code. The engine records how tensors depend on one another. The chain rule supplies the mathematics. The optimizer uses the computed gradients.
Summary
An automatic differentiation engine computes derivatives by recording operations and applying local derivative rules through the chain rule. PyTorch autograd builds dynamic computational graphs during execution. Tensors with requires_grad=True participate in this graph.
Calling backward() starts reverse-mode differentiation from a scalar loss and accumulates gradients into leaf tensors. Tools such as no_grad, inference_mode, detach, autograd.grad, custom Function, and gradcheck give finer control over gradient computation.
Understanding autograd makes PyTorch training behavior predictable. It explains why gradients accumulate, why graphs consume memory, why in-place operations can fail, and how model parameters receive updates.