# Automatic Differentiation Engines

An automatic differentiation engine is the system that records numerical operations and computes derivatives from them. In PyTorch, this system is called autograd. It is responsible for building the backward graph during the forward pass and executing gradient computation when `backward()` is called.

Automatic differentiation is different from symbolic differentiation and numerical differentiation. Symbolic differentiation manipulates formulas. Numerical differentiation estimates derivatives using small perturbations. Automatic differentiation evaluates exact derivative rules for the operations actually performed by the program.

### Why Automatic Differentiation Matters

Deep learning models are large compositions of simple operations. A model may contain matrix multiplications, convolutions, attention operations, nonlinearities, normalization layers, indexing operations, losses, and reductions.

Writing derivatives for the full model by hand would be slow and error-prone. Automatic differentiation solves this by requiring derivative rules only for primitive operations. The engine combines those local rules using the chain rule.

For example, a training step may look simple:

```python id="ql4don"
pred = model(X)
loss = loss_fn(pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

Behind this code, the engine records the operations used to produce `loss`, traverses them backward, computes vector-Jacobian products, and fills parameter gradients.

### Autograd in PyTorch

PyTorch autograd works with tensors. A tensor participates in gradient tracking when

```python id="3tfj28"
requires_grad=True
```

For example:

```python id="5f2nkf"
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x

print(y)
print(y.requires_grad)
print(y.grad_fn)
```

The tensor `y` has a `grad_fn` because it was produced by operations involving `x`. That `grad_fn` is a node in the backward graph.

Calling

```python id="3tsjuu"
y.backward()
```

computes

$$
\frac{dy}{dx}=2x+3.
$$

At \(x=2\), this is \(7\):

```python id="qe7lom"
y.backward()
print(x.grad)  # tensor(7.)
```

### Leaf Tensors and Non-Leaf Tensors

A leaf tensor is created directly by the user and is not the result of an operation tracked by autograd.

```python id="ttlt5h"
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)

y = w * x
```

Here `x` and `w` are leaf tensors. The tensor `y` is not a leaf tensor because it was produced by multiplication.

```python id="53mhvu"
print(x.is_leaf)  # True
print(w.is_leaf)  # True
print(y.is_leaf)  # False
```

Gradients are stored by default only on leaf tensors:

```python id="0m6li6"
y.backward()

print(x.grad)  # tensor(3.)
print(w.grad)  # tensor(2.)
print(y.grad)  # usually None
```

Intermediate tensors are still used in the backward pass. Their gradients are usually discarded unless explicitly retained.

```python id="1aapgv"
x = torch.tensor(2.0, requires_grad=True)
a = x ** 2
a.retain_grad()

y = 3 * a
y.backward()

print(a.grad)  # tensor(3.)
```

### The Dynamic Graph

PyTorch builds the computation graph dynamically. The graph is created as Python code executes.

```python id="ug0kec"
x = torch.tensor(2.0, requires_grad=True)

if x.item() > 0:
    y = x ** 2
else:
    y = -x

y.backward()
print(x.grad)  # tensor(4.)
```

Only the branch that actually runs is recorded. This makes PyTorch natural for models with loops, conditionals, variable-length inputs, and complex control flow.

Each forward pass builds a fresh graph. This is why ordinary training loops can reuse the same model code repeatedly:

```python id="1eg7q3"
for X, y in dataloader:
    pred = model(X)
    loss = loss_fn(pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

Each iteration constructs a new graph for that batch, then frees it after backward.

### Backward Graph Nodes

Every differentiable operation has a backward rule. In PyTorch, the output tensor stores a reference to the function that created it.

```python id="0506ao"
x = torch.tensor(2.0, requires_grad=True)

a = x + 1
b = a ** 2
c = b.mean()

print(a.grad_fn)
print(b.grad_fn)
print(c.grad_fn)
```

You may see names such as:

```text id="0w0ht6"
<AddBackward0>
<PowBackward0>
<MeanBackward0>
```

These names indicate the backward operations that will be used during gradient computation.

The graph points backward from outputs to inputs. Starting at `loss`, autograd follows `grad_fn` links and applies the corresponding backward rules.

### Saved Tensors

Backward rules often need values from the forward pass.

For example, for

$$
y=x^2,
$$

the backward rule needs \(x\), because

$$
\frac{dy}{dx}=2x.
$$

For ReLU, the backward rule needs to know which inputs were positive. For softmax and cross-entropy, the backward rule needs intermediate probabilities or logits. For attention, backward needs several intermediate tensors related to queries, keys, values, and attention probabilities.

The autograd engine stores such values as saved tensors. This is why training consumes more memory than inference.

A simplified view:

```text id="ak2uiu"
forward pass:
    compute outputs
    save needed tensors

backward pass:
    load saved tensors
    compute gradients
    release graph
```

When memory is limited, techniques such as gradient checkpointing store fewer tensors and recompute them during backward.

### Gradient Accumulation Semantics

PyTorch accumulates gradients into the `.grad` field.

```python id="rlk2z4"
x = torch.tensor(2.0, requires_grad=True)

y1 = x ** 2
y1.backward()

print(x.grad)  # tensor(4.)

y2 = 3 * x
y2.backward()

print(x.grad)  # tensor(7.)
```

The second backward call adds \(3\) to the existing gradient.

This behavior supports cases where one wants to accumulate gradients across multiple losses or microbatches. In ordinary training, clear gradients before the backward pass:

```python id="xidins"
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

A common memory-efficient pattern is:

```python id="coz7mt"
optimizer.zero_grad()

for microbatch in microbatches:
    loss = compute_loss(microbatch)
    loss = loss / len(microbatches)
    loss.backward()

optimizer.step()
```

Here gradients are intentionally accumulated before one optimizer step.

### Disabling Gradient Tracking

Not every tensor operation needs gradients. Evaluation, metric computation, logging, and target generation often should not build a graph.

Use `torch.no_grad()` when gradients are unnecessary:

```python id="fwnm46"
model.eval()

with torch.no_grad():
    pred = model(X)
```

Use `torch.inference_mode()` for inference-only code:

```python id="ufo6af"
model.eval()

with torch.inference_mode():
    pred = model(X)
```

Both reduce memory use by preventing graph construction. `inference_mode()` can be faster because it disables additional autograd bookkeeping, but it is less flexible if tensors later need to re-enter gradient-tracked computation.

### Detaching Tensors

The method `.detach()` returns a tensor that shares data with the original tensor but has no gradient history.

```python id="pjkhjz"
x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
z = y.detach()

print(y.requires_grad)  # True
print(z.requires_grad)  # False
```

This stops gradient flow through `z`.

Example:

```python id="tqkfkk"
x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
z = y.detach()
loss = 3 * z

print(loss.requires_grad)  # False
```

Detaching is common in reinforcement learning, teacher-student training, contrastive learning, and target-network methods.

A frequent mistake is detaching accidentally:

```python id="fhzw6j"
h = encoder(x)
h = h.detach()
out = decoder(h)
loss = loss_fn(out, target)
loss.backward()
```

In this code, gradients cannot reach `encoder`.

### `backward()` and `autograd.grad`

PyTorch provides two common ways to compute gradients.

The first is `backward()`, which accumulates gradients into `.grad` fields:

```python id="ygc03h"
x = torch.tensor(2.0, requires_grad=True)

y = x ** 3
y.backward()

print(x.grad)  # tensor(12.)
```

The second is `torch.autograd.grad`, which returns gradients directly:

```python id="wulv1w"
x = torch.tensor(2.0, requires_grad=True)

y = x ** 3
grad_x = torch.autograd.grad(y, x)[0]

print(grad_x)  # tensor(12.)
print(x.grad)  # None
```

Use `backward()` in ordinary training loops. Use `autograd.grad` when implementing algorithms that need explicit gradient values without accumulating into parameters, such as meta-learning, gradient penalties, implicit differentiation, or Hessian-vector products.

### Higher-Order Derivatives

Autograd can compute derivatives of derivatives if asked to build a graph for the gradient computation.

```python id="8ts17p"
x = torch.tensor(2.0, requires_grad=True)

y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]

d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]

print(dy_dx)    # tensor(12., grad_fn=<MulBackward0>)
print(d2y_dx2)  # tensor(12.)
```

The argument

```python id="jhlk6z"
create_graph=True
```

tells PyTorch to record the operations used to compute the first derivative.

Higher-order derivatives are more expensive than ordinary gradients. They also retain more graph structure in memory. Use them only when the algorithm requires them.

### Custom Autograd Functions

Most models can be built from existing PyTorch operations. Sometimes one needs a custom operation with a custom backward rule.

PyTorch supports this with `torch.autograd.Function`.

```python id="ldqbgp"
import torch

class Square(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x ** 2

    @staticmethod
    def backward(ctx, grad_output):
        (x,) = ctx.saved_tensors
        grad_x = grad_output * 2 * x
        return grad_x

x = torch.tensor(3.0, requires_grad=True)

y = Square.apply(x)
y.backward()

print(x.grad)  # tensor(6.)
```

The `forward` method computes the output. The `backward` method receives the upstream gradient and returns gradients for each input.

Custom functions are useful for special kernels, memory-efficient operations, numerical tricks, or research prototypes. They should be tested carefully with gradient checking.

### Gradient Checking for Custom Functions

When writing a custom backward rule, compare it with numerical finite differences.

PyTorch provides `gradcheck`, which works best with double precision:

```python id="2u4vef"
import torch
from torch.autograd import gradcheck

class Square(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x ** 2

    @staticmethod
    def backward(ctx, grad_output):
        (x,) = ctx.saved_tensors
        return grad_output * 2 * x

x = torch.randn(3, dtype=torch.double, requires_grad=True)

test = gradcheck(Square.apply, (x,))
print(test)
```

A result of `True` means the custom backward rule agrees with finite-difference estimates within tolerance.

### In-Place Operations

In-place operations modify a tensor directly. Examples include:

```python id="n4ba44"
x.add_(1)
x.relu_()
```

The underscore convention usually indicates an in-place PyTorch operation.

In-place operations can save memory, but they may break autograd if they overwrite values needed for backward.

```python id="6cxlp1"
x = torch.randn(4, requires_grad=True)

y = x ** 2
x.add_(1.0)

loss = y.sum()
loss.backward()
```

This may fail because the backward rule for `x ** 2` needs the original `x`.

Use out-of-place operations unless memory pressure requires in-place modification and the operation is known to be safe:

```python id="cw9y67"
x2 = x + 1.0
```

### Views, Copies, and Contiguity

Some tensor operations return views. A view shares storage with the original tensor but has different shape or strides.

```python id="928u3l"
x = torch.arange(6.0, requires_grad=True)
y = x.view(2, 3)

print(y)
```

Operations such as `view`, `reshape`, `transpose`, `permute`, `squeeze`, and `unsqueeze` may affect layout. Autograd tracks these operations, but in-place modifications on views can be subtle.

A tensor is contiguous when its elements are stored in memory in the order expected by its shape. After `permute`, a tensor may be non-contiguous:

```python id="d3ffw2"
x = torch.randn(2, 3, 4)
y = x.permute(0, 2, 1)

print(y.is_contiguous())  # False
```

Some operations require contiguous memory. Calling `.contiguous()` creates a contiguous copy:

```python id="5sb5we"
z = y.contiguous()
```

This matters for performance and for some low-level operations.

### Autograd and Modules

`nn.Module` parameters are tensors wrapped as `nn.Parameter`. By default, they require gradients.

```python id="b2roeg"
import torch.nn as nn

layer = nn.Linear(3, 4)

print(layer.weight.requires_grad)  # True
print(layer.bias.requires_grad)    # True
```

After a backward pass:

```python id="z9vmrh"
X = torch.randn(8, 3)
Y = layer(X)
loss = Y.sum()

loss.backward()

print(layer.weight.grad.shape)
print(layer.bias.grad.shape)
```

Optimizers read these `.grad` fields and update the parameter values.

To freeze a layer:

```python id="uhjxi5"
for param in layer.parameters():
    param.requires_grad = False
```

Frozen parameters do not receive gradients and are not updated if excluded from the optimizer.

### Common Autograd Errors

A common error is calling `backward()` on a tensor with no graph:

```python id="47hvl9"
x = torch.tensor(2.0)
y = x ** 2
y.backward()
```

The fix:

```python id="v31xdx"
x = torch.tensor(2.0, requires_grad=True)
```

Another error is calling backward twice through the same graph:

```python id="x1f6os"
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2

y.backward()
y.backward()
```

The graph is freed after the first backward call. To reuse it, specify:

```python id="j7wxew"
y.backward(retain_graph=True)
```

This is rarely needed in ordinary training.

Another common issue is `grad is None`. This usually means the tensor did not contribute to the loss, does not require gradients, or is not a leaf tensor whose gradient is retained.

```python id="o91ryj"
for name, param in model.named_parameters():
    if param.grad is None:
        print(name, "has no gradient")
```

### Mental Model

A useful mental model of autograd is:

```text id="4tms1q"
forward:
    run tensor operations
    build graph
    save needed values

backward:
    start from scalar loss
    traverse graph backward
    apply local derivative rules
    accumulate gradients into leaves
```

The user writes ordinary tensor code. The engine records how tensors depend on one another. The chain rule supplies the mathematics. The optimizer uses the computed gradients.

### Summary

An automatic differentiation engine computes derivatives by recording operations and applying local derivative rules through the chain rule. PyTorch autograd builds dynamic computational graphs during execution. Tensors with `requires_grad=True` participate in this graph.

Calling `backward()` starts reverse-mode differentiation from a scalar loss and accumulates gradients into leaf tensors. Tools such as `no_grad`, `inference_mode`, `detach`, `autograd.grad`, custom `Function`, and `gradcheck` give finer control over gradient computation.

Understanding autograd makes PyTorch training behavior predictable. It explains why gradients accumulate, why graphs consume memory, why in-place operations can fail, and how model parameters receive updates.

