Skip to content

Bidirectional Networks

Standard recurrent neural networks process sequences in one direction, usually from left to right. At time step $t$, the hidden state summarizes only the past:

Standard recurrent neural networks process sequences in one direction, usually from left to right. At time step tt, the hidden state summarizes only the past:

ht=f(ht1,xt). h_t = f(h_{t-1}, x_t).

This is appropriate for causal prediction tasks such as language generation, where future tokens are unavailable.

However, many sequence tasks are not causal. When labeling or analyzing a sequence, the entire input is already known. In such cases, future context can improve prediction quality.

Bidirectional recurrent networks address this limitation by processing the sequence in both directions.

Motivation

Consider the sentence:

The river bank was flooded after the storm.

To determine the meaning of the word “bank,” the model benefits from later words such as “flooded.”

Now consider:

She deposited cash at the bank yesterday.

Again, future context changes interpretation.

A left-to-right recurrent network only sees preceding tokens. It cannot use future information when predicting the hidden representation at position tt.

For many tasks, this restriction is unnecessary.

Examples include:

TaskFuture context useful?
Named entity recognitionYes
Part-of-speech taggingYes
Sentiment analysisYes
Speech recognition (offline)Yes
Language generationNo
Autoregressive decodingNo

Bidirectional networks allow the model to use both past and future context simultaneously.

Forward and Backward Passes

A bidirectional recurrent network contains two recurrent systems.

The forward network processes the sequence left to right:

ht=f(ht1,xt). \overrightarrow{h_t} = f(\overrightarrow{h_{t-1}}, x_t).

The backward network processes the sequence right to left:

ht=f(ht+1,xt). \overleftarrow{h_t} = f(\overleftarrow{h_{t+1}}, x_t).

The final representation combines both hidden states:

ht=[ht;ht], h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}],

where [a;b][a;b] denotes concatenation.

This representation contains:

  • information from earlier positions,
  • information from later positions.

Sequence Processing

Suppose the sequence is

x1,x2,x3,x4. x_1, x_2, x_3, x_4.

The forward recurrence computes:

h1,h2,h3,h4. \overrightarrow{h_1}, \overrightarrow{h_2}, \overrightarrow{h_3}, \overrightarrow{h_4}.

The backward recurrence computes:

h4,h3,h2,h1. \overleftarrow{h_4}, \overleftarrow{h_3}, \overleftarrow{h_2}, \overleftarrow{h_1}.

At each position, the final representation merges both directions.

For example:

forward:   x1 -> x2 -> x3 -> x4
             |     |     |     |
backward:  x1 <- x2 <- x3 <- x4

The representation at position x2x_2 includes information from both:

  • tokens before x2x_2,
  • tokens after x2x_2.

Hidden State Dimensions

Suppose:

  • input dimension: dd
  • hidden size per direction: hh

Then:

TensorShape
Forward hidden state[h]
Backward hidden state[h]
Combined hidden state[2h]

The hidden dimension doubles because the representations are concatenated.

If the sequence tensor shape is

[B, T, D]

then the output shape becomes

[B, T, 2H]

where HH is the hidden size of one direction.

Bidirectional RNNs in PyTorch

PyTorch supports bidirectional recurrent layers directly.

Example:

import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=16,
    hidden_size=32,
    batch_first=True,
    bidirectional=True,
)

x = torch.randn(8, 20, 16)

output, h_n = rnn(x)

print(output.shape)
print(h_n.shape)

Output:

torch.Size([8, 20, 64])
torch.Size([2, 8, 32])

The output dimension is 64 because:

2×32=64. 2 \times 32 = 64.

The hidden tensor has shape:

[num_directions, batch_size, hidden_size]

Since the network is bidirectional:

num_directions = 2

Bidirectional LSTMs

Bidirectional processing is especially common with LSTMs.

Example:

lstm = nn.LSTM(
    input_size=128,
    hidden_size=256,
    batch_first=True,
    bidirectional=True,
)

x = torch.randn(32, 100, 128)

output, (h_n, c_n) = lstm(x)

print(output.shape)

Result:

torch.Size([32, 100, 512])

Again, the hidden dimension doubles.

Bidirectional LSTMs became standard in many NLP systems before transformers became dominant.

Combining Directions

The most common combination method is concatenation:

ht=[ht;ht]. h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}].

However, other strategies exist.

Summation

ht=ht+ht. h_t = \overrightarrow{h_t} + \overleftarrow{h_t}.

This keeps the hidden dimension unchanged.

Averaging

ht=ht+ht2. h_t = \frac{ \overrightarrow{h_t} + \overleftarrow{h_t} }{2}.

Learned Projection

The concatenated representation is projected back to a smaller dimension:

ht=W[ht;ht]. h_t = W [\overrightarrow{h_t}; \overleftarrow{h_t}].

Concatenation is most common because it preserves information from both directions explicitly.

Bidirectional Context

A bidirectional hidden state summarizes both preceding and following information.

This often improves:

PropertyEffect
Semantic understandingBetter contextual representations
Sequence labelingMore accurate token predictions
Ambiguity resolutionStronger contextual disambiguation
Long-range dependency modelingAccess to future evidence

Before transformers, bidirectional recurrent networks were widely used in:

  • speech recognition,
  • named entity recognition,
  • machine translation encoders,
  • handwriting recognition,
  • biomedical sequence analysis.

Causal Versus Noncausal Models

Bidirectional networks are noncausal.

At time step tt, the backward recurrence depends on future inputs:

xt+1,xt+2, x_{t+1}, x_{t+2}, \ldots

This means bidirectional networks cannot be used directly for autoregressive generation.

For example, when generating text token by token, future tokens are unknown.

Thus:

TaskBidirectional allowed?
Text classificationYes
Machine translation encoderYes
Speech transcription (offline)Yes
Autoregressive language generationNo
Real-time streaming predictionUsually no

This distinction later influenced transformer design:

  • encoder models such as BERT are bidirectional,
  • decoder models such as GPT are causal.

Bidirectional Encoders

Sequence-to-sequence systems often use bidirectional encoders.

The encoder reads the full input sequence:

x1,,xT x_1, \ldots, x_T

and constructs contextual representations:

h1,,hT. h_1, \ldots, h_T.

The decoder then generates outputs autoregressively.

Bidirectional encoding improves representation quality because the encoder has access to the full source sequence.

This idea appeared in early neural machine translation systems before attention-based transformers became dominant.

Computational Cost

A bidirectional network roughly doubles recurrent computation.

If a unidirectional RNN requires:

O(TH2) O(TH^2)

operations, then a bidirectional version requires approximately:

O(2TH2). O(2TH^2).

Memory usage also increases because two hidden-state streams must be stored.

However, the increase is often worthwhile because contextual representations become substantially stronger.

Deep Bidirectional Networks

Multiple bidirectional recurrent layers can be stacked.

Example:

lstm = nn.LSTM(
    input_size=128,
    hidden_size=256,
    num_layers=3,
    batch_first=True,
    bidirectional=True,
)

The output of one bidirectional layer becomes the input of the next.

Deep bidirectional LSTMs were historically very successful in:

  • speech recognition,
  • OCR,
  • machine translation,
  • biomedical NLP.

However, deep recurrent stacks are computationally expensive and difficult to parallelize compared with transformers.

Relationship to Attention Models

Bidirectional recurrent networks partially solve the context problem by using both directions.

However, they still compress information into hidden states propagated sequentially through time.

Attention models later replaced this mechanism with direct pairwise interaction between positions.

Instead of storing all history inside one hidden vector, attention computes:

interaction between all positions directly. \text{interaction between all positions directly}.

This greatly improves long-range dependency modeling.

Still, bidirectional recurrence introduced an important idea:

contextual representations should depend on both left and right context.

Modern transformer encoders preserve this principle.

Practical Example: Sequence Tagging

Suppose we perform named entity recognition.

Input:

Barack Obama was born in Hawaii.

The word “Obama” benefits from both:

  • previous context: “Barack”
  • future context: “was born”

A bidirectional model produces richer contextual embeddings for each token.

Typical pipeline:

embeddings = embedding(tokens)

output, _ = bi_lstm(embeddings)

logits = classifier(output)

The classifier predicts one label per token.

Summary

Bidirectional recurrent networks process sequences in both forward and backward directions. At each position, the representation combines information from past and future context.

This improves contextual understanding for many sequence tasks, especially sequence labeling and encoding problems where the full input sequence is available.

Bidirectional models are noncausal, so they cannot be used directly for autoregressive generation. However, they became foundational in NLP and speech systems before transformers emerged.

The next section studies practical sequence modeling applications built on recurrent computation.