# Recurrent Computation

A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs.

Sequential problems require a different structure. A model processing text, audio, or time series must preserve information across positions. The output at time $t$ should depend not only on the current input $x_t$, but also on earlier observations.

Recurrent neural networks solve this problem by introducing a hidden state that evolves over time.

### The Core Idea of Recurrence

A recurrent neural network processes a sequence one step at a time. At each step, it combines:

1. The current input $x_t$
2. The previous hidden state $h_{t-1}$

to produce a new hidden state:

$$
h_t = f(h_{t-1}, x_t).
$$

The hidden state acts as memory. It stores information extracted from earlier sequence elements.

A simple recurrent update is

$$
h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h).
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"h_t=\\tanh(W_{xh}x_t+W_{hh}h_{t-1}+b_h)"}}

The output may then be computed as

$$
y_t = W_{hy}h_t + b_y.
$$

Here:

| Symbol | Meaning |
|---|---|
| $x_t$ | Input at time step $t$ |
| $h_t$ | Hidden state at time step $t$ |
| $y_t$ | Output at time step $t$ |
| $W_{xh}$ | Input-to-hidden weights |
| $W_{hh}$ | Hidden-to-hidden recurrent weights |
| $W_{hy}$ | Hidden-to-output weights |

The same parameter matrices are reused at every time step. This parameter sharing is one of the defining properties of recurrent networks.

### Reusing the Same Computation

Unlike feedforward networks, recurrent networks apply the same transformation repeatedly across sequence positions.

Suppose we process a sequence:

$$
x_1, x_2, x_3, \ldots, x_T.
$$

The recurrence unfolds as

$$
h_1 = f(h_0, x_1),
$$

$$
h_2 = f(h_1, x_2),
$$

$$
h_3 = f(h_2, x_3),
$$

and so on.

The hidden state carries information forward through time.

This repeated reuse has several important consequences:

| Property | Effect |
|---|---|
| Shared parameters | Fewer learnable weights |
| Variable-length processing | Same model handles different sequence lengths |
| Temporal memory | Earlier inputs affect later outputs |
| Recursive structure | Natural modeling of ordered data |

Because the same function is reused, the model learns a general transition rule rather than separate computations for each position.

### Unrolling Through Time

Although an RNN is defined recursively, it is easier to analyze by expanding the recurrence into a computational graph.

This process is called unrolling.

For a sequence of length 3:

$$
h_1 = f(h_0, x_1),
$$

$$
h_2 = f(h_1, x_2),
$$

$$
h_3 = f(h_2, x_3).
$$

The network can be visualized as a chain of repeated copies of the same cell:

```text
x1 -> [RNN Cell] -> h1
                 |
x2 -> [RNN Cell] -> h2
                 |
x3 -> [RNN Cell] -> h3
```

Each cell has identical parameters. Only the hidden state changes over time.

Unrolling converts recurrent computation into a deep computational graph whose depth equals the sequence length.

This interpretation is essential for understanding backpropagation through time.

### Dimensions in Recurrent Networks

Suppose:

$$
x_t \in \mathbb{R}^d,
$$

$$
h_t \in \mathbb{R}^h.
$$

Then the parameter shapes are:

| Parameter | Shape |
|---|---|
| $W_{xh}$ | $h \times d$ |
| $W_{hh}$ | $h \times h$ |
| $b_h$ | $h$ |
| $W_{hy}$ | $o \times h$ |

where $o$ is the output dimension.

For example:

- Input embedding size: 256
- Hidden state size: 512
- Vocabulary output size: 10,000

Then:

| Tensor | Shape |
|---|---|
| $x_t$ | `[256]` |
| $h_t$ | `[512]` |
| $W_{xh}$ | `[512, 256]` |
| $W_{hh}$ | `[512, 512]` |
| $W_{hy}$ | `[10000, 512]` |

The recurrent weight matrix $W_{hh}$ transforms the previous hidden state into the next hidden representation.

### Hidden State as Memory

The hidden state is intended to summarize previous sequence information.

At time step $t$,

$$
h_t
$$

contains information derived from

$$
x_1, x_2, \ldots, x_t.
$$

Thus the hidden state behaves as compressed memory.

For example, when reading the sentence:

> "The cat that chased the mouse was hungry."

the model should remember that the grammatical subject is “cat,” even after several intermediate words.

In practice, the hidden state cannot perfectly preserve arbitrary long-term information. As sequences grow longer, earlier information may become difficult to retain. This limitation leads to vanishing gradient problems and motivates gated architectures such as LSTMs and GRUs.

### Initial Hidden State

The recurrence requires an initial state:

$$
h_0.
$$

The simplest choice is a zero vector:

$$
h_0 = 0.
$$

In PyTorch, recurrent layers use zero initialization by default unless a custom initial state is provided.

Example:

```python id="a4kh2x"
import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=16,
    hidden_size=32,
    batch_first=True,
)

x = torch.randn(8, 20, 16)

output, h_n = rnn(x)

print(output.shape)  # [8, 20, 32]
print(h_n.shape)     # [1, 8, 32]
```

Here:

| Tensor | Meaning |
|---|---|
| `output` | Hidden state at every time step |
| `h_n` | Final hidden state |

The first dimension of `h_n` corresponds to the number of recurrent layers.

### Processing Entire Sequences

In practice, we rarely compute recurrent steps manually. PyTorch recurrent modules process the whole sequence tensor at once.

Suppose the input shape is

```python id="iqvj8r"
[B, T, D]
```

where:

| Symbol | Meaning |
|---|---|
| $B$ | Batch size |
| $T$ | Sequence length |
| $D$ | Feature dimension |

The RNN internally loops across time steps:

```python id="axoj8o"
for t in range(T):
    h_t = recurrence(x_t, h_{t-1})
```

but the user only provides the full tensor.

Example:

```python id="kmopsl"
x = torch.randn(32, 100, 64)

rnn = nn.RNN(
    input_size=64,
    hidden_size=128,
    batch_first=True,
)

output, h_n = rnn(x)

print(output.shape)  # [32, 100, 128]
```

The output tensor contains the hidden representation for every position in the sequence.

### Many-to-One and Many-to-Many Models

Recurrent computation supports multiple sequence architectures.

#### Many-to-One

A sequence produces one output.

Example: sentiment classification.

$$
(x_1, \ldots, x_T) \mapsto y
$$

Typically, the final hidden state is used:

$$
y = g(h_T).
$$

Example:

```python id="2w0g1i"
final_hidden = output[:, -1, :]
logits = classifier(final_hidden)
```

#### Many-to-Many

A sequence produces outputs at every step.

Example: part-of-speech tagging.

$$
(x_1, \ldots, x_T)
\mapsto
(y_1, \ldots, y_T)
$$

Each hidden state generates a prediction.

#### Sequence Generation

The model predicts the next element repeatedly:

$$
p(x_{t+1} \mid x_1, \ldots, x_t).
$$

This autoregressive setup is fundamental in language modeling.

### Recurrent Computation as Dynamical Systems

An RNN can also be viewed as a discrete dynamical system.

The hidden state evolves recursively:

$$
h_t = f(h_{t-1}, x_t).
$$

Without input, the recurrence becomes:

$$
h_t = f(h_{t-1}).
$$

The system repeatedly transforms its own state.

The stability of this transformation matters. If the recurrent transformation expands too strongly, hidden states may explode. If it contracts too strongly, information disappears.

This behavior is closely related to the eigenvalues of the recurrent weight matrix $W_{hh}$.

If repeated multiplication grows exponentially, activations explode:

$$
\|W_{hh}^t\| \to \infty.
$$

If repeated multiplication shrinks exponentially, information vanishes:

$$
\|W_{hh}^t\| \to 0.
$$

These dynamics are one reason why training recurrent networks can be difficult.

### Recurrent Networks as Deep Networks

An unrolled recurrent network forms a deep computational graph.

For a sequence of length $T$, the recurrence creates a graph of depth $T$.

For example:

- 10 tokens → depth 10
- 1000 tokens → depth 1000

Thus RNNs are effectively extremely deep networks.

This explains why:

- gradients may vanish,
- gradients may explode,
- long-range dependencies are difficult,
- optimization becomes unstable for long sequences.

These issues motivated the development of gated recurrent networks and attention mechanisms.

### Computational Complexity

Suppose:

- sequence length: $T$
- hidden dimension: $h$

At each step, recurrent computation performs approximately:

$$
O(h^2)
$$

operations for the recurrent matrix multiplication.

Processing the full sequence costs approximately:

$$
O(Th^2).
$$

A key limitation of recurrent models is sequential dependence:

$$
h_t \text{ requires } h_{t-1}.
$$

This prevents full parallelization across time steps.

Transformers later addressed this limitation by replacing recurrence with attention, allowing parallel sequence processing.

### A Minimal Recurrent Cell

A recurrent cell can be implemented directly.

Example:

```python id="hq9mx9"
class SimpleRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()

        self.W_xh = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size)

    def forward(self, x_t, h_prev):
        h_t = torch.tanh(
            self.W_xh(x_t) +
            self.W_hh(h_prev)
        )
        return h_t
```

Sequence processing:

```python id="eqt5x3"
cell = SimpleRNNCell(16, 32)

x = torch.randn(8, 20, 16)

h = torch.zeros(8, 32)

outputs = []

for t in range(20):
    h = cell(x[:, t, :], h)
    outputs.append(h)
```

This explicit implementation reveals the underlying mechanics hidden inside higher-level recurrent modules.

### Summary

Recurrent computation introduces memory into neural networks through a hidden state updated across sequence positions.

At each time step, the model combines the current input with the previous hidden state to produce a new representation. The same parameters are reused at every step, allowing recurrent networks to process sequences of arbitrary length.

Unrolling recurrence through time reveals that recurrent networks are deep computational graphs. This gives them expressive power for sequential data, but also creates optimization difficulties such as vanishing and exploding gradients.

The next section studies how gradients propagate through recurrent computation and why long-range dependency learning becomes difficult in standard RNNs.

