# Gradient Flow in Deep Networks

Gradient flow describes how derivative information moves backward through a neural network during training. A model may have the correct architecture and loss function, yet train poorly because its gradients shrink, explode, become noisy, or fail to reach important layers.

Backpropagation computes gradients. Gradient flow describes the quality of those gradients.

### The Basic Idea

Consider a deep network written as a sequence of transformations:

$$
h_1 = f_1(x),
$$

$$
h_2 = f_2(h_1),
$$

$$
\cdots
$$

$$
h_L = f_L(h_{L-1}),
$$

$$
\ell = \mathcal{L}(h_L, y).
$$

The gradient with respect to an early hidden state is computed by repeated application of the chain rule:

$$
\frac{\partial \ell}{\partial h_k} =
\frac{\partial \ell}{\partial h_L}
\frac{\partial h_L}{\partial h_{L-1}}
\cdots
\frac{\partial h_{k+1}}{\partial h_k}.
$$

This expression contains a product of many derivative terms. If those terms tend to shrink vectors, early layers receive very small gradients. If they tend to amplify vectors, early layers receive very large gradients.

### Vanishing Gradients

A vanishing gradient occurs when gradients become extremely small as they move backward through the network.

Suppose each layer roughly multiplies gradient magnitude by \(0.5\). After 20 layers, the scale becomes

$$
0.5^{20}
\approx
9.54\times10^{-7}.
$$

A gradient that small produces almost no parameter update.

This problem was especially severe in older deep networks using sigmoid or tanh activations. These functions saturate. When the input is very positive or very negative, the derivative becomes small.

For sigmoid,

$$
\sigma(x)=\frac{1}{1+e^{-x}},
$$

$$
\sigma'(x)=\sigma(x)(1-\sigma(x)).
$$

The maximum derivative is \(0.25\). Across many layers, repeated multiplication by values below \(1\) can make gradients vanish.

In PyTorch:

```python
import torch

x = torch.tensor([0.0, 2.0, 10.0])
s = torch.sigmoid(x)

print(s)
print(s * (1 - s))
```

For large positive input, sigmoid is close to \(1\), and its derivative is close to \(0\).

### Exploding Gradients

An exploding gradient occurs when gradients become very large as they move backward.

Suppose each layer roughly multiplies gradient magnitude by \(1.5\). After 40 layers:

$$
1.5^{40}
\approx
1.1\times10^{7}.
$$

Large gradients can make parameter updates unstable. The loss may become `nan`, parameters may overflow, or training may oscillate without convergence.

Exploding gradients often occur in recurrent networks, very deep networks, and models with poor initialization. They also appear when the learning rate is too large.

A common symptom is a sudden jump in loss:

```text
step 100: loss = 2.31
step 101: loss = 2.28
step 102: loss = 15793.4
step 103: loss = nan
```

### Gradient Norms

A practical way to inspect gradient flow is to measure gradient norms.

For a parameter tensor \(W\), the gradient norm is often

$$
\|\nabla_W \ell\|_2.
$$

In PyTorch:

```python
def grad_norms(model):
    rows = []
    for name, param in model.named_parameters():
        if param.grad is None:
            continue
        rows.append((name, param.grad.norm().item()))
    return rows
```

After `loss.backward()`, call:

```python
for name, norm in grad_norms(model):
    print(name, norm)
```

Very small norms in early layers may indicate vanishing gradients. Extremely large norms may indicate instability.

A more complete diagnostic also tracks activation statistics:

```python
def tensor_stats(x):
    return {
        "mean": x.mean().item(),
        "std": x.std().item(),
        "min": x.min().item(),
        "max": x.max().item(),
    }
```

Training stability depends on both activations and gradients.

### Initialization and Gradient Flow

Weight initialization affects how signal and gradient magnitudes change across layers.

If weights are too small, activations and gradients shrink. If weights are too large, activations and gradients grow.

For a linear layer,

$$
h = Wx,
$$

we want the variance of \(h\) to remain controlled when the layer is applied. This leads to initialization schemes that scale weights according to fan-in and fan-out.

For activations such as tanh, Xavier initialization is often used. For ReLU networks, Kaiming initialization is often used.

In PyTorch:

```python
import torch.nn as nn

layer = nn.Linear(128, 64)

nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
nn.init.zeros_(layer.bias)
```

Good initialization does not guarantee successful training, but poor initialization can prevent training before it starts.

### Activation Functions and Gradient Flow

Activation choice strongly affects gradient flow.

Sigmoid and tanh can saturate. ReLU avoids saturation for positive values:

$$
\text{ReLU}(x)=\max(0,x).
$$

Its derivative is

$$
\text{ReLU}'(x) =
\begin{cases}
1, & x>0, \\
0, & x<0.
\end{cases}
$$

Positive activations pass gradients directly. Negative activations block gradients.

This behavior makes ReLU easier to train than sigmoid in many deep networks, but it can also create dead units. A dead ReLU is a unit that always receives negative inputs and therefore always outputs zero.

Variants such as Leaky ReLU, GELU, and SiLU reduce some of these problems.

```python
import torch
import torch.nn.functional as F

x = torch.linspace(-3, 3, steps=7)

print(F.relu(x))
print(F.gelu(x))
print(F.silu(x))
```

### Normalization Layers

Normalization layers help stabilize activations and gradients.

Batch normalization normalizes activations using statistics computed over a minibatch. Layer normalization normalizes using statistics computed over features within each example.

Batch normalization is common in convolutional networks. Layer normalization is common in transformers.

A normalization layer changes the geometry of the optimization problem. It can reduce sensitivity to initialization and allow larger learning rates. It also changes gradient flow by keeping intermediate activations in a controlled numerical range.

In PyTorch:

```python
bn = nn.BatchNorm1d(128)
ln = nn.LayerNorm(128)

x = torch.randn(32, 128)

y_bn = bn(x)
y_ln = ln(x)
```

Batch normalization depends on batch statistics and behaves differently during training and evaluation. Layer normalization uses per-example statistics and behaves more consistently across batch sizes.

### Residual Connections

Residual connections are one of the most important tools for gradient flow.

A residual block computes

$$
y = x + F(x).
$$

The gradient is

$$
\frac{\partial \ell}{\partial x} =
\frac{\partial \ell}{\partial y}
\left(
I + \frac{\partial F}{\partial x}
\right).
$$

The identity term gives gradients a direct path backward. Even if the derivative through \(F\) is small, the skip connection can still carry gradient information.

In PyTorch:

```python
class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim),
            nn.ReLU(),
            nn.Linear(dim, dim),
        )

    def forward(self, x):
        return x + self.net(x)
```

Residual connections made very deep convolutional networks and transformers much easier to train.

### Gradient Clipping

Gradient clipping limits gradient size. It is commonly used when training recurrent networks, transformers, and reinforcement learning systems.

The most common form clips the global norm:

$$
g \leftarrow g \cdot \min\left(1, \frac{c}{\|g\|}\right),
$$

where \(c\) is the clipping threshold.

In PyTorch:

```python
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()
```

Clipping does not fix the underlying cause of unstable gradients, but it prevents a single bad step from destroying training.

### Learning Rate and Gradient Flow

The gradient gives direction and scale. The learning rate determines how strongly parameters move in that direction.

For gradient descent,

$$
\theta_{t+1} =
\theta_t -
\eta \nabla_\theta \ell.
$$

Even if gradients are mathematically correct, a poor learning rate can make training fail.

If \(\eta\) is too small, training is slow. If \(\eta\) is too large, training can diverge.

Schedulers help adjust the learning rate during training:

```python
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=1000,
)

for step in range(1000):
    optimizer.zero_grad()
    loss = compute_loss()
    loss.backward()
    optimizer.step()
    scheduler.step()
```

Warmup is often used for transformers. The learning rate starts small, increases during early steps, then decays.

### Gradient Flow in Transformers

Transformers rely heavily on residual connections and normalization.

A simplified transformer block is:

$$
x' = x + \text{Attention}(\text{Norm}(x)),
$$

$$
y = x' + \text{MLP}(\text{Norm}(x')).
$$

This is the pre-norm transformer pattern. It often trains more stably than post-norm designs for deep models because normalization happens before each sublayer and residual paths remain direct.

In PyTorch-like pseudocode:

```python
class TransformerBlock(nn.Module):
    def __init__(self, dim, attn, mlp):
        super().__init__()
        self.norm1 = nn.LayerNorm(dim)
        self.attn = attn
        self.norm2 = nn.LayerNorm(dim)
        self.mlp = mlp

    def forward(self, x):
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x
```

This block structure helps preserve gradient flow through many layers.

### Diagnosing Gradient Flow Problems

When a model fails to train, inspect the following quantities:

| Signal | Possible issue |
|---|---|
| Loss becomes `nan` | Exploding gradients, bad data, unstable loss |
| Early-layer gradient norms near zero | Vanishing gradients |
| Very large gradient norms | Exploding gradients or too large learning rate |
| Activations have huge magnitude | Poor initialization or unstable normalization |
| Activations mostly zero | Dead ReLU units or sparse gates |
| Gradients are `None` | Detached graph or parameter unused in loss |

A simple debug loop:

```python
optimizer.zero_grad()
pred = model(X)
loss = loss_fn(pred, y)
loss.backward()

for name, param in model.named_parameters():
    if param.grad is None:
        print(name, "grad is None")
    else:
        print(name, param.grad.norm().item())
```

This often catches graph disconnections and unstable layers quickly.

### Practical Rules

Use initialization suited to the activation function. Use normalization layers in deep networks. Prefer residual connections when stacking many layers. Clip gradients when training unstable sequence models. Track gradient norms during debugging. Use learning-rate warmup for large transformer models.

For small models, gradient flow problems may be minor. For deep networks, recurrent models, and foundation models, gradient flow is a central design constraint.

### Summary

Gradient flow describes how useful gradient information travels backward through a network. Poor gradient flow causes vanishing gradients, exploding gradients, dead units, unstable updates, and disconnected parameters.

Backpropagation gives exact gradients for the computational graph. Gradient flow determines whether those gradients are numerically useful for optimization. Initialization, activation functions, normalization, residual connections, learning-rate schedules, and clipping are all tools for controlling this flow.

