Sequential Data

Many learning problems involve data whose meaning depends on order. A sentence is not just a bag of words. A speech signal is not just a collection of sound amplitudes. A stock price record is not just a set of numbers. In each case, the position of each observation matters.

Sequential data is data indexed by an ordered variable, usually time or position. We write a sequence as

$$ x_1, x_2, \ldots, x_T, $$

where $T$ is the sequence length. The element $x_t$ is the observation at step $t$. In language, $x_t$ may be a token. In audio, it may be a waveform sample or a spectrogram frame. In time-series forecasting, it may be a vector of measured variables at time $t$.

A sequence can contain scalar observations,

$$ x_t \in \mathbb{R}, $$

or vector observations,

$$ x_t \in \mathbb{R}^d. $$

For example, a weather sequence may have one vector per hour:

$$ x_t = \begin{bmatrix} \text{temperature}_t \ \text{humidity}_t \ \text{wind speed}_t \end{bmatrix} \in \mathbb{R}^3. $$

A batch of such sequences is usually stored as a tensor. In PyTorch, a common shape convention is

[batch_size, sequence_length, feature_dim]

For example:

import torch

B = 32   # batch size
T = 100  # sequence length
D = 16   # features per time step

x = torch.randn(B, T, D)
print(x.shape)  # torch.Size([32, 100, 16])

Here x[:, 0, :] contains the first time step for every sequence in the batch, while x[:, t, :] contains the observations at step t.

Why Order Matters

For independent data, we often assume that each example can be processed without reference to other examples. Image classification is often treated this way: one image enters the model, one label comes out.

Sequential data violates this simple view. The meaning of an element depends on context.

Consider the sentence:

$$ \text{the bank raised interest rates} $$

and compare it with:

$$ \text{the boat reached the river bank} $$

The word “bank” has different meanings because of the surrounding tokens. A model that ignores order loses this information.

The same problem appears in time series. A temperature reading of $30^\circ$ has different meaning depending on whether the previous readings were rising, falling, stable, seasonal, or anomalous.

A sequence model must therefore represent dependency across positions.

Types of Sequence Problems

Sequence modeling tasks can be grouped by the shape of their inputs and outputs.

Task type	Input	Output	Example
Sequence-to-one	sequence	single value	sentiment classification
Sequence-to-sequence	sequence	sequence	machine translation
One-to-sequence	single value	sequence	image captioning
Next-step prediction	prefix sequence	next element	language modeling
Sequence labeling	sequence	label per step	named entity recognition

In sequence-to-one learning, the model reads a full sequence and returns one prediction. For example, a sentiment classifier reads a sentence and predicts whether the review is positive or negative.

$$ (x_1, \ldots, x_T) \mapsto y $$

In sequence-to-sequence learning, the model maps one sequence to another sequence. Machine translation is the standard example.

$$ (x_1, \ldots, x_T) \mapsto (y_1, \ldots, y_S) $$

In next-step prediction, the model predicts the next element from previous elements:

$$ p(x_{t+1} \mid x_1, \ldots, x_t). $$

This objective is central to language modeling. Given a prefix of tokens, the model predicts the next token.

Temporal Dependence

Sequential data often contains temporal dependence. This means that nearby observations tend to be statistically related.

A simple way to express this is

$$ x_t \text{ depends on } x_{t-1}, x_{t-2}, \ldots $$

In many systems, recent observations are more informative than distant observations. In other systems, long-range structure matters. A paragraph may contain a pronoun whose meaning depends on a noun many words earlier. A financial time series may contain weekly, monthly, or yearly cycles. A genome sequence may contain dependencies over very long distances.

This creates a central problem: a sequence model must decide what information to remember, what to ignore, and how to combine local and long-range evidence.

Variable-Length Sequences

Unlike fixed-size tabular examples, sequences often have different lengths.

One sentence may contain 5 tokens. Another may contain 200 tokens. One audio clip may last 1 second. Another may last 30 seconds.

Neural networks usually process tensors with regular shapes, so variable-length data must be handled carefully. The common methods are padding, masking, truncation, and packing.

Padding extends shorter sequences to a common length:

# Three sequences of different lengths:
# [4 tokens], [2 tokens], [5 tokens]

padded = torch.tensor([
    [12, 45, 9,  3,  0],
    [81, 7,  0,  0,  0],
    [22, 6,  5, 91, 13],
])

Here 0 may be a padding token. Padding values should not affect the model’s prediction, so we often use a mask:

mask = padded != 0
print(mask)

The mask identifies real tokens and padded positions.

For long sequences, truncation cuts the sequence to a maximum length. This reduces memory and computation, but it may remove important context.

Sequential Data in PyTorch

PyTorch sequence models usually operate on three-dimensional tensors.

For many modern models, the preferred layout is:

[batch_size, sequence_length, feature_dim]

For some recurrent modules, PyTorch historically used:

[sequence_length, batch_size, feature_dim]

However, most recurrent layers support batch_first=True, which makes the input shape easier to read:

rnn = torch.nn.RNN(
    input_size=16,
    hidden_size=32,
    batch_first=True,
)

x = torch.randn(32, 100, 16)
output, h_n = rnn(x)

print(output.shape)  # torch.Size([32, 100, 32])
print(h_n.shape)     # torch.Size([1, 32, 32])

The output tensor contains a hidden representation for every time step. The final hidden state h_n summarizes the processed sequence.

Features, Tokens, and Embeddings

Raw sequence elements are often not directly suitable for neural networks.

In language modeling, tokens are integer IDs:

tokens = torch.tensor([
    [12, 45, 9, 3],
    [81, 7, 5, 0],
])

These IDs must be converted into dense vectors using an embedding layer:

embedding = torch.nn.Embedding(
    num_embeddings=10000,
    embedding_dim=256,
)

x = embedding(tokens)
print(x.shape)  # torch.Size([2, 4, 256])

Now each token has a vector representation. The sequence has shape:

$$ [B, T, D]. $$

For audio or sensor data, the input may already be numerical. Still, it is common to transform the raw signal into a more useful representation, such as spectrogram frames, normalized feature vectors, or learned embeddings.

Causality and Direction

Some sequence tasks allow the model to use both past and future context. Other tasks only allow past context.

In language generation, the model predicts the next token from previous tokens. It must not look at future tokens during training or inference:

$$ p(x_t \mid x_1, \ldots, x_{t-1}). $$

This is called causal modeling.

In sequence labeling, the model may use the full sequence. For example, when tagging named entities in a complete sentence, both left and right context can be useful. The label for a word may depend on words that appear after it.

This distinction affects the architecture. Causal models process information from left to right. Bidirectional models use context from both directions. Transformer decoders are usually causal. Encoder models, such as BERT-style models, use bidirectional context.

Local and Long-Range Structure

Sequential data often contains both local and long-range patterns.

Local patterns occur over nearby positions. In text, phrases such as “New York” or “machine learning” are local patterns. In audio, phonemes and short acoustic events are local patterns.

Long-range patterns span many positions. In text, a subject may determine the verb much later in the sentence. In code, an opening bracket may match a closing bracket many lines later. In time series, yearly seasonality may connect points far apart.

A good sequence model needs mechanisms for both.

Recurrent neural networks process sequences one step at a time and store information in a hidden state. This gives them a natural way to model temporal dependence. However, long-range dependencies can be difficult for basic RNNs. Later sections introduce LSTMs, GRUs, and attention mechanisms as ways to address this limitation.

Sequence Modeling as State Updating

A useful abstraction is to view sequence modeling as repeated state updating.

At each step $t$, the model receives input $x_t$ and updates an internal state $h_t$:

$$ h_t = f(h_{t-1}, x_t). $$

The state $h_t$ is intended to summarize the information seen so far. The model can then produce an output:

$$ y_t = g(h_t). $$

This is the basic idea behind recurrent neural networks. The same function $f$ is reused at every time step, so the model can process sequences of different lengths.

In PyTorch-like pseudocode:

h = initial_state

for t in range(T):
    h = update(h, x[:, t, :])
    y = output(h)

This loop captures the central structure of recurrent computation.

Summary

Sequential data consists of ordered observations. Each element has meaning because of its position and its relationship to other elements. Text, speech, audio, video, sensor streams, code, and financial records are all common examples.

A sequence model must handle order, variable length, temporal dependence, and context. Some tasks require one output for the whole sequence. Others require one output per time step or a new generated sequence.

Recurrent neural networks model sequential data by maintaining a hidden state that is updated step by step. This state-based view gives a direct way to process ordered data and prepares the ground for backpropagation through time, gated recurrent networks, and attention-based sequence models.