Automatic Differentiation Engines

An automatic differentiation engine is the system that records numerical operations and computes derivatives from them. In PyTorch, this system is called autograd. It is responsible for building the backward graph during the forward pass and executing gradient computation when backward() is called.

Automatic differentiation is different from symbolic differentiation and numerical differentiation. Symbolic differentiation manipulates formulas. Numerical differentiation estimates derivatives using small perturbations. Automatic differentiation evaluates exact derivative rules for the operations actually performed by the program.

Why Automatic Differentiation Matters

Deep learning models are large compositions of simple operations. A model may contain matrix multiplications, convolutions, attention operations, nonlinearities, normalization layers, indexing operations, losses, and reductions.

Writing derivatives for the full model by hand would be slow and error-prone. Automatic differentiation solves this by requiring derivative rules only for primitive operations. The engine combines those local rules using the chain rule.

For example, a training step may look simple:

pred = model(X)
loss = loss_fn(pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Behind this code, the engine records the operations used to produce loss, traverses them backward, computes vector-Jacobian products, and fills parameter gradients.

Autograd in PyTorch

PyTorch autograd works with tensors. A tensor participates in gradient tracking when

requires_grad=True

For example:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x

print(y)
print(y.requires_grad)
print(y.grad_fn)

The tensor y has a grad_fn because it was produced by operations involving x. That grad_fn is a node in the backward graph.

Calling

y.backward()

computes

\frac{dy}{dx}=2x+3.

At $x=2$ , this is $7$ :

y.backward()
print(x.grad)  # tensor(7.)

Leaf Tensors and Non-Leaf Tensors

A leaf tensor is created directly by the user and is not the result of an operation tracked by autograd.

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)

y = w * x

Here x and w are leaf tensors. The tensor y is not a leaf tensor because it was produced by multiplication.

print(x.is_leaf)  # True
print(w.is_leaf)  # True
print(y.is_leaf)  # False

Gradients are stored by default only on leaf tensors:

y.backward()

print(x.grad)  # tensor(3.)
print(w.grad)  # tensor(2.)
print(y.grad)  # usually None

Intermediate tensors are still used in the backward pass. Their gradients are usually discarded unless explicitly retained.

x = torch.tensor(2.0, requires_grad=True)
a = x ** 2
a.retain_grad()

y = 3 * a
y.backward()

print(a.grad)  # tensor(3.)

The Dynamic Graph

PyTorch builds the computation graph dynamically. The graph is created as Python code executes.

x = torch.tensor(2.0, requires_grad=True)

if x.item() > 0:
    y = x ** 2
else:
    y = -x

y.backward()
print(x.grad)  # tensor(4.)

Only the branch that actually runs is recorded. This makes PyTorch natural for models with loops, conditionals, variable-length inputs, and complex control flow.

Each forward pass builds a fresh graph. This is why ordinary training loops can reuse the same model code repeatedly:

for X, y in dataloader:
    pred = model(X)
    loss = loss_fn(pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Each iteration constructs a new graph for that batch, then frees it after backward.

Backward Graph Nodes

Every differentiable operation has a backward rule. In PyTorch, the output tensor stores a reference to the function that created it.

x = torch.tensor(2.0, requires_grad=True)

a = x + 1
b = a ** 2
c = b.mean()

print(a.grad_fn)
print(b.grad_fn)
print(c.grad_fn)

You may see names such as:

<AddBackward0>
<PowBackward0>
<MeanBackward0>

These names indicate the backward operations that will be used during gradient computation.

The graph points backward from outputs to inputs. Starting at loss, autograd follows grad_fn links and applies the corresponding backward rules.

Saved Tensors

Backward rules often need values from the forward pass.

For example, for

y=x^2,

the backward rule needs $x$ , because

\frac{dy}{dx}=2x.

For ReLU, the backward rule needs to know which inputs were positive. For softmax and cross-entropy, the backward rule needs intermediate probabilities or logits. For attention, backward needs several intermediate tensors related to queries, keys, values, and attention probabilities.

The autograd engine stores such values as saved tensors. This is why training consumes more memory than inference.

A simplified view:

forward pass:
    compute outputs
    save needed tensors

backward pass:
    load saved tensors
    compute gradients
    release graph

When memory is limited, techniques such as gradient checkpointing store fewer tensors and recompute them during backward.

Gradient Accumulation Semantics

PyTorch accumulates gradients into the .grad field.

x = torch.tensor(2.0, requires_grad=True)

y1 = x ** 2
y1.backward()

print(x.grad)  # tensor(4.)

y2 = 3 * x
y2.backward()

print(x.grad)  # tensor(7.)

The second backward call adds $3$ to the existing gradient.

This behavior supports cases where one wants to accumulate gradients across multiple losses or microbatches. In ordinary training, clear gradients before the backward pass:

optimizer.zero_grad()
loss.backward()
optimizer.step()

A common memory-efficient pattern is:

optimizer.zero_grad()

for microbatch in microbatches:
    loss = compute_loss(microbatch)
    loss = loss / len(microbatches)
    loss.backward()

optimizer.step()

Here gradients are intentionally accumulated before one optimizer step.

Disabling Gradient Tracking

Not every tensor operation needs gradients. Evaluation, metric computation, logging, and target generation often should not build a graph.

Use torch.no_grad() when gradients are unnecessary:

model.eval()

with torch.no_grad():
    pred = model(X)

Use torch.inference_mode() for inference-only code:

model.eval()

with torch.inference_mode():
    pred = model(X)

Both reduce memory use by preventing graph construction. inference_mode() can be faster because it disables additional autograd bookkeeping, but it is less flexible if tensors later need to re-enter gradient-tracked computation.

Detaching Tensors

The method .detach() returns a tensor that shares data with the original tensor but has no gradient history.

x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
z = y.detach()

print(y.requires_grad)  # True
print(z.requires_grad)  # False

This stops gradient flow through z.

Example:

x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
z = y.detach()
loss = 3 * z

print(loss.requires_grad)  # False

Detaching is common in reinforcement learning, teacher-student training, contrastive learning, and target-network methods.

A frequent mistake is detaching accidentally:

h = encoder(x)
h = h.detach()
out = decoder(h)
loss = loss_fn(out, target)
loss.backward()

In this code, gradients cannot reach encoder.

`backward()` and `autograd.grad`

PyTorch provides two common ways to compute gradients.

The first is backward(), which accumulates gradients into .grad fields:

x = torch.tensor(2.0, requires_grad=True)

y = x ** 3
y.backward()

print(x.grad)  # tensor(12.)

The second is torch.autograd.grad, which returns gradients directly:

x = torch.tensor(2.0, requires_grad=True)

y = x ** 3
grad_x = torch.autograd.grad(y, x)[0]

print(grad_x)  # tensor(12.)
print(x.grad)  # None

Use backward() in ordinary training loops. Use autograd.grad when implementing algorithms that need explicit gradient values without accumulating into parameters, such as meta-learning, gradient penalties, implicit differentiation, or Hessian-vector products.

Higher-Order Derivatives

Autograd can compute derivatives of derivatives if asked to build a graph for the gradient computation.

x = torch.tensor(2.0, requires_grad=True)

y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]

d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]

print(dy_dx)    # tensor(12., grad_fn=<MulBackward0>)
print(d2y_dx2)  # tensor(12.)

The argument

create_graph=True

tells PyTorch to record the operations used to compute the first derivative.

Higher-order derivatives are more expensive than ordinary gradients. They also retain more graph structure in memory. Use them only when the algorithm requires them.

Custom Autograd Functions

Most models can be built from existing PyTorch operations. Sometimes one needs a custom operation with a custom backward rule.

PyTorch supports this with torch.autograd.Function.

import torch

class Square(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x ** 2

    @staticmethod
    def backward(ctx, grad_output):
        (x,) = ctx.saved_tensors
        grad_x = grad_output * 2 * x
        return grad_x

x = torch.tensor(3.0, requires_grad=True)

y = Square.apply(x)
y.backward()

print(x.grad)  # tensor(6.)

The forward method computes the output. The backward method receives the upstream gradient and returns gradients for each input.

Custom functions are useful for special kernels, memory-efficient operations, numerical tricks, or research prototypes. They should be tested carefully with gradient checking.

Gradient Checking for Custom Functions

When writing a custom backward rule, compare it with numerical finite differences.

PyTorch provides gradcheck, which works best with double precision:

import torch
from torch.autograd import gradcheck

class Square(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x ** 2

    @staticmethod
    def backward(ctx, grad_output):
        (x,) = ctx.saved_tensors
        return grad_output * 2 * x

x = torch.randn(3, dtype=torch.double, requires_grad=True)

test = gradcheck(Square.apply, (x,))
print(test)

A result of True means the custom backward rule agrees with finite-difference estimates within tolerance.

In-Place Operations

In-place operations modify a tensor directly. Examples include:

x.add_(1)
x.relu_()

The underscore convention usually indicates an in-place PyTorch operation.

In-place operations can save memory, but they may break autograd if they overwrite values needed for backward.

x = torch.randn(4, requires_grad=True)

y = x ** 2
x.add_(1.0)

loss = y.sum()
loss.backward()

This may fail because the backward rule for x ** 2 needs the original x.

Use out-of-place operations unless memory pressure requires in-place modification and the operation is known to be safe:

x2 = x + 1.0

Views, Copies, and Contiguity

Some tensor operations return views. A view shares storage with the original tensor but has different shape or strides.

x = torch.arange(6.0, requires_grad=True)
y = x.view(2, 3)

print(y)

Operations such as view, reshape, transpose, permute, squeeze, and unsqueeze may affect layout. Autograd tracks these operations, but in-place modifications on views can be subtle.

A tensor is contiguous when its elements are stored in memory in the order expected by its shape. After permute, a tensor may be non-contiguous:

x = torch.randn(2, 3, 4)
y = x.permute(0, 2, 1)

print(y.is_contiguous())  # False

Some operations require contiguous memory. Calling .contiguous() creates a contiguous copy:

z = y.contiguous()

This matters for performance and for some low-level operations.

Autograd and Modules

nn.Module parameters are tensors wrapped as nn.Parameter. By default, they require gradients.

import torch.nn as nn

layer = nn.Linear(3, 4)

print(layer.weight.requires_grad)  # True
print(layer.bias.requires_grad)    # True

After a backward pass:

X = torch.randn(8, 3)
Y = layer(X)
loss = Y.sum()

loss.backward()

print(layer.weight.grad.shape)
print(layer.bias.grad.shape)

Optimizers read these .grad fields and update the parameter values.

To freeze a layer:

for param in layer.parameters():
    param.requires_grad = False

Frozen parameters do not receive gradients and are not updated if excluded from the optimizer.

Common Autograd Errors

A common error is calling backward() on a tensor with no graph:

x = torch.tensor(2.0)
y = x ** 2
y.backward()

The fix:

x = torch.tensor(2.0, requires_grad=True)

Another error is calling backward twice through the same graph:

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2

y.backward()
y.backward()

The graph is freed after the first backward call. To reuse it, specify:

y.backward(retain_graph=True)

This is rarely needed in ordinary training.

Another common issue is grad is None. This usually means the tensor did not contribute to the loss, does not require gradients, or is not a leaf tensor whose gradient is retained.

for name, param in model.named_parameters():
    if param.grad is None:
        print(name, "has no gradient")

Mental Model

A useful mental model of autograd is:

forward:
    run tensor operations
    build graph
    save needed values

backward:
    start from scalar loss
    traverse graph backward
    apply local derivative rules
    accumulate gradients into leaves

The user writes ordinary tensor code. The engine records how tensors depend on one another. The chain rule supplies the mathematics. The optimizer uses the computed gradients.

Summary

An automatic differentiation engine computes derivatives by recording operations and applying local derivative rules through the chain rule. PyTorch autograd builds dynamic computational graphs during execution. Tensors with requires_grad=True participate in this graph.

Calling backward() starts reverse-mode differentiation from a scalar loss and accumulates gradients into leaf tensors. Tools such as no_grad, inference_mode, detach, autograd.grad, custom Function, and gradcheck give finer control over gradient computation.

Understanding autograd makes PyTorch training behavior predictable. It explains why gradients accumulate, why graphs consume memory, why in-place operations can fail, and how model parameters receive updates.