Skip to content

Recurrent Computation

A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs.

A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs.

Sequential problems require a different structure. A model processing text, audio, or time series must preserve information across positions. The output at time tt should depend not only on the current input xtx_t, but also on earlier observations.

Recurrent neural networks solve this problem by introducing a hidden state that evolves over time.

The Core Idea of Recurrence

A recurrent neural network processes a sequence one step at a time. At each step, it combines:

  1. The current input xtx_t
  2. The previous hidden state ht1h_{t-1}

to produce a new hidden state:

ht=f(ht1,xt). h_t = f(h_{t-1}, x_t).

The hidden state acts as memory. It stores information extracted from earlier sequence elements.

A simple recurrent update is

ht=tanh(Wxhxt+Whhht1+bh). h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h).

genui{“math_block_widget_always_prefetch_v2”:{“content”:“h_t=\tanh(W_{xh}x_t+W_{hh}h_{t-1}+b_h)”}}

The output may then be computed as

yt=Whyht+by. y_t = W_{hy}h_t + b_y.

Here:

SymbolMeaning
xtx_tInput at time step tt
hth_tHidden state at time step tt
yty_tOutput at time step tt
WxhW_{xh}Input-to-hidden weights
WhhW_{hh}Hidden-to-hidden recurrent weights
WhyW_{hy}Hidden-to-output weights

The same parameter matrices are reused at every time step. This parameter sharing is one of the defining properties of recurrent networks.

Reusing the Same Computation

Unlike feedforward networks, recurrent networks apply the same transformation repeatedly across sequence positions.

Suppose we process a sequence:

x1,x2,x3,,xT. x_1, x_2, x_3, \ldots, x_T.

The recurrence unfolds as

h1=f(h0,x1), h_1 = f(h_0, x_1), h2=f(h1,x2), h_2 = f(h_1, x_2), h3=f(h2,x3), h_3 = f(h_2, x_3),

and so on.

The hidden state carries information forward through time.

This repeated reuse has several important consequences:

PropertyEffect
Shared parametersFewer learnable weights
Variable-length processingSame model handles different sequence lengths
Temporal memoryEarlier inputs affect later outputs
Recursive structureNatural modeling of ordered data

Because the same function is reused, the model learns a general transition rule rather than separate computations for each position.

Unrolling Through Time

Although an RNN is defined recursively, it is easier to analyze by expanding the recurrence into a computational graph.

This process is called unrolling.

For a sequence of length 3:

h1=f(h0,x1), h_1 = f(h_0, x_1), h2=f(h1,x2), h_2 = f(h_1, x_2), h3=f(h2,x3). h_3 = f(h_2, x_3).

The network can be visualized as a chain of repeated copies of the same cell:

x1 -> [RNN Cell] -> h1
                 |
x2 -> [RNN Cell] -> h2
                 |
x3 -> [RNN Cell] -> h3

Each cell has identical parameters. Only the hidden state changes over time.

Unrolling converts recurrent computation into a deep computational graph whose depth equals the sequence length.

This interpretation is essential for understanding backpropagation through time.

Dimensions in Recurrent Networks

Suppose:

xtRd, x_t \in \mathbb{R}^d, htRh. h_t \in \mathbb{R}^h.

Then the parameter shapes are:

ParameterShape
WxhW_{xh}h×dh \times d
WhhW_{hh}h×hh \times h
bhb_hhh
WhyW_{hy}o×ho \times h

where oo is the output dimension.

For example:

  • Input embedding size: 256
  • Hidden state size: 512
  • Vocabulary output size: 10,000

Then:

TensorShape
xtx_t[256]
hth_t[512]
WxhW_{xh}[512, 256]
WhhW_{hh}[512, 512]
WhyW_{hy}[10000, 512]

The recurrent weight matrix WhhW_{hh} transforms the previous hidden state into the next hidden representation.

Hidden State as Memory

The hidden state is intended to summarize previous sequence information.

At time step tt,

ht h_t

contains information derived from

x1,x2,,xt. x_1, x_2, \ldots, x_t.

Thus the hidden state behaves as compressed memory.

For example, when reading the sentence:

“The cat that chased the mouse was hungry.”

the model should remember that the grammatical subject is “cat,” even after several intermediate words.

In practice, the hidden state cannot perfectly preserve arbitrary long-term information. As sequences grow longer, earlier information may become difficult to retain. This limitation leads to vanishing gradient problems and motivates gated architectures such as LSTMs and GRUs.

Initial Hidden State

The recurrence requires an initial state:

h0. h_0.

The simplest choice is a zero vector:

h0=0. h_0 = 0.

In PyTorch, recurrent layers use zero initialization by default unless a custom initial state is provided.

Example:

import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=16,
    hidden_size=32,
    batch_first=True,
)

x = torch.randn(8, 20, 16)

output, h_n = rnn(x)

print(output.shape)  # [8, 20, 32]
print(h_n.shape)     # [1, 8, 32]

Here:

TensorMeaning
outputHidden state at every time step
h_nFinal hidden state

The first dimension of h_n corresponds to the number of recurrent layers.

Processing Entire Sequences

In practice, we rarely compute recurrent steps manually. PyTorch recurrent modules process the whole sequence tensor at once.

Suppose the input shape is

[B, T, D]

where:

SymbolMeaning
BBBatch size
TTSequence length
DDFeature dimension

The RNN internally loops across time steps:

for t in range(T):
    h_t = recurrence(x_t, h_{t-1})

but the user only provides the full tensor.

Example:

x = torch.randn(32, 100, 64)

rnn = nn.RNN(
    input_size=64,
    hidden_size=128,
    batch_first=True,
)

output, h_n = rnn(x)

print(output.shape)  # [32, 100, 128]

The output tensor contains the hidden representation for every position in the sequence.

Many-to-One and Many-to-Many Models

Recurrent computation supports multiple sequence architectures.

Many-to-One

A sequence produces one output.

Example: sentiment classification.

(x1,,xT)y (x_1, \ldots, x_T) \mapsto y

Typically, the final hidden state is used:

y=g(hT). y = g(h_T).

Example:

final_hidden = output[:, -1, :]
logits = classifier(final_hidden)

Many-to-Many

A sequence produces outputs at every step.

Example: part-of-speech tagging.

(x1,,xT)(y1,,yT) (x_1, \ldots, x_T) \mapsto (y_1, \ldots, y_T)

Each hidden state generates a prediction.

Sequence Generation

The model predicts the next element repeatedly:

p(xt+1x1,,xt). p(x_{t+1} \mid x_1, \ldots, x_t).

This autoregressive setup is fundamental in language modeling.

Recurrent Computation as Dynamical Systems

An RNN can also be viewed as a discrete dynamical system.

The hidden state evolves recursively:

ht=f(ht1,xt). h_t = f(h_{t-1}, x_t).

Without input, the recurrence becomes:

ht=f(ht1). h_t = f(h_{t-1}).

The system repeatedly transforms its own state.

The stability of this transformation matters. If the recurrent transformation expands too strongly, hidden states may explode. If it contracts too strongly, information disappears.

This behavior is closely related to the eigenvalues of the recurrent weight matrix WhhW_{hh}.

If repeated multiplication grows exponentially, activations explode:

Whht. \|W_{hh}^t\| \to \infty.

If repeated multiplication shrinks exponentially, information vanishes:

Whht0. \|W_{hh}^t\| \to 0.

These dynamics are one reason why training recurrent networks can be difficult.

Recurrent Networks as Deep Networks

An unrolled recurrent network forms a deep computational graph.

For a sequence of length TT, the recurrence creates a graph of depth TT.

For example:

  • 10 tokens → depth 10
  • 1000 tokens → depth 1000

Thus RNNs are effectively extremely deep networks.

This explains why:

  • gradients may vanish,
  • gradients may explode,
  • long-range dependencies are difficult,
  • optimization becomes unstable for long sequences.

These issues motivated the development of gated recurrent networks and attention mechanisms.

Computational Complexity

Suppose:

  • sequence length: TT
  • hidden dimension: hh

At each step, recurrent computation performs approximately:

O(h2) O(h^2)

operations for the recurrent matrix multiplication.

Processing the full sequence costs approximately:

O(Th2). O(Th^2).

A key limitation of recurrent models is sequential dependence:

ht requires ht1. h_t \text{ requires } h_{t-1}.

This prevents full parallelization across time steps.

Transformers later addressed this limitation by replacing recurrence with attention, allowing parallel sequence processing.

A Minimal Recurrent Cell

A recurrent cell can be implemented directly.

Example:

class SimpleRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()

        self.W_xh = nn.Linear(input_size, hidden_size)
        self.W_hh = nn.Linear(hidden_size, hidden_size)

    def forward(self, x_t, h_prev):
        h_t = torch.tanh(
            self.W_xh(x_t) +
            self.W_hh(h_prev)
        )
        return h_t

Sequence processing:

cell = SimpleRNNCell(16, 32)

x = torch.randn(8, 20, 16)

h = torch.zeros(8, 32)

outputs = []

for t in range(20):
    h = cell(x[:, t, :], h)
    outputs.append(h)

This explicit implementation reveals the underlying mechanics hidden inside higher-level recurrent modules.

Summary

Recurrent computation introduces memory into neural networks through a hidden state updated across sequence positions.

At each time step, the model combines the current input with the previous hidden state to produce a new representation. The same parameters are reused at every step, allowing recurrent networks to process sequences of arbitrary length.

Unrolling recurrence through time reveals that recurrent networks are deep computational graphs. This gives them expressive power for sequential data, but also creates optimization difficulties such as vanishing and exploding gradients.

The next section studies how gradients propagate through recurrent computation and why long-range dependency learning becomes difficult in standard RNNs.