Recurrent Computation

A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs.

Sequential problems require a different structure. A model processing text, audio, or time series must preserve information across positions. The output at time $t$ should depend not only on the current input $x_t$, but also on earlier observations.

Recurrent neural networks solve this problem by introducing a hidden state that evolves over time.

The Core Idea of Recurrence

A recurrent neural network processes a sequence one step at a time. At each step, it combines:

The current input $x_t$
The previous hidden state $h_{t-1}$

to produce a new hidden state:

$$ h_t = f(h_{t-1}, x_t). $$

The hidden state acts as memory. It stores information extracted from earlier sequence elements.

A simple recurrent update is

$$ h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h). $$

genui{"math_block_widget_always_prefetch_v2":{"content":"h_t=\tanh(W_{xh}x_t+W_{hh}h_{t-1}+b_h)"}}

The output may then be computed as

$$ y_t = W_{hy}h_t + b_y. $$

Here:

Symbol	Meaning
$x_t$	Input at time step $t$
$h_t$	Hidden state at time step $t$
$y_t$	Output at time step $t$
$W_{xh}$	Input-to-hidden weights
$W_{hh}$	Hidden-to-hidden recurrent weights
$W_{hy}$	Hidden-to-output weights

The same parameter matrices are reused at every time step. This parameter sharing is one of the defining properties of recurrent networks.

Reusing the Same Computation

Unlike feedforward networks, recurrent networks apply the same transformation repeatedly across sequence positions.

Suppose we process a sequence:

$$ x_1, x_2, x_3, \ldots, x_T. $$

The recurrence unfolds as

$$ h_1 = f(h_0, x_1), $$

$$ h_2 = f(h_1, x_2), $$

$$ h_3 = f(h_2, x_3), $$

and so on.

The hidden state carries information forward through time.

This repeated reuse has several important consequences:

Property	Effect
Shared parameters	Fewer learnable weights
Variable-length processing	Same model handles different sequence lengths
Temporal memory	Earlier inputs affect later outputs
Recursive structure	Natural modeling of ordered data

Because the same function is reused, the model learns a general transition rule rather than separate computations for each position.

Unrolling Through Time

Although an RNN is defined recursively, it is easier to analyze by expanding the recurrence into a computational graph.

This process is called unrolling.

For a sequence of length 3:

$$ h_1 = f(h_0, x_1), $$

$$ h_2 = f(h_1, x_2), $$

$$ h_3 = f(h_2, x_3). $$

The network can be visualized as a chain of repeated copies of the same cell:

x1 -> [RNN Cell] -> h1
                 |
x2 -> [RNN Cell] -> h2
                 |
x3 -> [RNN Cell] -> h3

Each cell has identical parameters. Only the hidden state changes over time.

Unrolling converts recurrent computation into a deep computational graph whose depth equals the sequence length.

This interpretation is essential for understanding backpropagation through time.

Dimensions in Recurrent Networks

Suppose:

$$ x_t \in \mathbb{R}^d, $$

$$ h_t \in \mathbb{R}^h. $$

Then the parameter shapes are:

Parameter	Shape
$W_{xh}$	$h \times d$
$W_{hh}$	$h \times h$
$b_h$	$h$
$W_{hy}$	$o \times h$

where $o$ is the output dimension.

For example:

Input embedding size: 256
Hidden state size: 512
Vocabulary output size: 10,000

Then:

Tensor	Shape
$x_t$	`[256]`
$h_t$	`[512]`
$W_{xh}$	`[512, 256]`
$W_{hh}$	`[512, 512]`
$W_{hy}$	`[10000, 512]`

The recurrent weight matrix $W_{hh}$ transforms the previous hidden state into the next hidden representation.

Hidden State as Memory

The hidden state is intended to summarize previous sequence information.

At time step $t$,

$$ h_t $$

contains information derived from

$$ x_1, x_2, \ldots, x_t. $$

Thus the hidden state behaves as compressed memory.

For example, when reading the sentence:

"The cat that chased the mouse was hungry."

the model should remember that the grammatical subject is “cat,” even after several intermediate words.

In practice, the hidden state cannot perfectly preserve arbitrary long-term information. As sequences grow longer, earlier information may become difficult to retain. This limitation leads to vanishing gradient problems and motivates gated architectures such as LSTMs and GRUs.

Initial Hidden State

The recurrence requires an initial state:

$$ h_0. $$

The simplest choice is a zero vector:

$$ h_0 = 0. $$

In PyTorch, recurrent layers use zero initialization by default unless a custom initial state is provided.

Example:

import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=16,
    hidden_size=32,
    batch_first=True,
)

x = torch.randn(8, 20, 16)

output, h_n = rnn(x)

print(output.shape)  # [8, 20, 32]
print(h_n.shape)     # [1, 8, 32]

Here:

Tensor	Meaning
`output`	Hidden state at every time step
`h_n`	Final hidden state

The first dimension of h_n corresponds to the number of recurrent layers.

Processing Entire Sequences

In practice, we rarely compute recurrent steps manually. PyTorch recurrent modules process the whole sequence tensor at once.

Suppose the input shape is

[B, T, D]

where:

Symbol	Meaning
$B$	Batch size
$T$	Sequence length
$D$	Feature dimension

The RNN internally loops across time steps:

for t in range(T):
    h_t = recurrence(x_t, h_{t-1})

but the user only provides the full tensor.

Example:

x = torch.randn(32, 100, 64)

rnn = nn.RNN(
    input_size=64,
    hidden_size=128,
    batch_first=True,
)

output, h_n = rnn(x)

print(output.shape)  # [32, 100, 128]

The output tensor contains the hidden representation for every position in the sequence.

Many-to-One and Many-to-Many Models

Recurrent computation supports multiple sequence architectures.

Many-to-One

A sequence produces one output.

Example: sentiment classification.

$$ (x_1, \ldots, x_T) \mapsto y $$

Typically, the final hidden state is used:

$$ y = g(h_T). $$

Example:

final_hidden = output[:, -1, :]
logits = classifier(final_hidden)

Many-to-Many

A sequence produces outputs at every step.

Example: part-of-speech tagging.

$$ (x_1, \ldots, x_T) \mapsto (y_1, \ldots, y_T) $$

Each hidden state generates a prediction.

Sequence Generation

The model predicts the next element repeatedly:

$$ p(x_{t+1} \mid x_1, \ldots, x_t). $$

This autoregressive setup is fundamental in language modeling.

Recurrent Computation as Dynamical Systems

An RNN can also be viewed as a discrete dynamical system.

The hidden state evolves recursively:

$$ h_t = f(h_{t-1}, x_t). $$

Without input, the recurrence becomes:

$$ h_t = f(h_{t-1}). $$

The system repeatedly transforms its own state.

The stability of this transformation matters. If the recurrent transformation expands too strongly, hidden states may explode. If it contracts too strongly, information disappears.

This behavior is closely related to the eigenvalues of the recurrent weight matrix $W_{hh}$.

If repeated multiplication grows exponentially, activations explode:

$$ |W_{hh}^t| \to \infty. $$

If repeated multiplication shrinks exponentially, information vanishes:

$$ |W_{hh}^t| \to 0. $$

These dynamics are one reason why training recurrent networks can be difficult.

Recurrent Networks as Deep Networks

An unrolled recurrent network forms a deep computational graph.

For a sequence of length $T$, the recurrence creates a graph of depth $T$.

For example:

10 tokens → depth 10
1000 tokens → depth 1000

Thus RNNs are effectively extremely deep networks.

This explains why:

gradients may vanish,
gradients may explode,
long-range dependencies are difficult,
optimization becomes unstable for long sequences.

These issues motivated the development of gated recurrent networks and attention mechanisms.

Computational Complexity

Suppose:

sequence length: $T$
hidden dimension: $h$

At each step, recurrent computation performs approximately:

$$ O(h^2) $$

operations for the recurrent matrix multiplication.

Processing the full sequence costs approximately:

$$ O(Th^2). $$

A key limitation of recurrent models is sequential dependence:

$$ h_t \text{ requires } h_{t-1}. $$

This prevents full parallelization across time steps.

Transformers later addressed this limitation by replacing recurrence with attention, allowing parallel sequence processing.

A Minimal Recurrent Cell

A recurrent cell can be implemented directly.

Example:

class SimpleRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()

        self.W_xh = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size)

    def forward(self, x_t, h_prev):
        h_t = torch.tanh(
            self.W_xh(x_t) +
            self.W_hh(h_prev)
        )
        return h_t

Sequence processing:

cell = SimpleRNNCell(16, 32)

x = torch.randn(8, 20, 16)

h = torch.zeros(8, 32)

outputs = []

for t in range(20):
    h = cell(x[:, t, :], h)
    outputs.append(h)

This explicit implementation reveals the underlying mechanics hidden inside higher-level recurrent modules.

Summary

Recurrent computation introduces memory into neural networks through a hidden state updated across sequence positions.

At each time step, the model combines the current input with the previous hidden state to produce a new representation. The same parameters are reused at every step, allowing recurrent networks to process sequences of arbitrary length.

Unrolling recurrence through time reveals that recurrent networks are deep computational graphs. This gives them expressive power for sequential data, but also creates optimization difficulties such as vanishing and exploding gradients.

The next section studies how gradients propagate through recurrent computation and why long-range dependency learning becomes difficult in standard RNNs.