Motivation for Attention

Sequence models often need to decide which parts of an input are relevant to a particular output. Attention is the mechanism that makes this decision explicit. Instead of compressing an entire input into one fixed-size vector, an attention layer allows the model to look back at many input positions and form a weighted combination of them.

The central problem is selective access. A model may receive a sequence of tokens, image patches, audio frames, or graph nodes. For each position, it must decide which other positions matter. Attention provides a differentiable way to compute that decision.

The Fixed-Vector Bottleneck

Before attention became standard, many sequence-to-sequence models used an encoder-decoder architecture based on recurrent neural networks.

The encoder reads an input sequence

x_1, x_2, \ldots, x_T

and compresses it into a final hidden state

h_T.

The decoder then produces an output sequence from this single vector.

This creates a bottleneck. The vector $h_T$ must contain all information needed by the decoder. For short sequences, this may work. For long sequences, it becomes difficult. Early input tokens may be forgotten. Fine-grained alignment between input and output may be lost.

For example, in machine translation, a word produced near the end of the output sentence may depend on a word near the beginning of the input sentence. A single final hidden state gives the decoder no direct path back to that word.

Attention removes this bottleneck. Instead of using only the final encoder state, the decoder can access all encoder states:

h_1, h_2, \ldots, h_T.

At each decoding step, the model computes which of these states are relevant.

Attention as Weighted Retrieval

Attention can be understood as a soft retrieval operation.

Suppose we have a collection of vectors:

v_1, v_2, \ldots, v_T.

These vectors may represent words, image patches, memory slots, or hidden states. Given a query vector $q$ , attention assigns a score to each vector, converts the scores into probabilities, and returns a weighted average.

The output is

z = \sum_{i=1}^{T} \alpha_i v_i,

where

\alpha_i \geq 0, \qquad \sum_{i=1}^{T} \alpha_i = 1.

The coefficients $\alpha_i$ are attention weights. A large $\alpha_i$ means that the model places more attention on the $i$ -th vector.

This is why attention is often described as differentiable memory access. The model does not choose one item by a hard lookup. It forms a smooth weighted mixture, so gradients can flow through the attention weights during training.

Queries, Keys, and Values

Modern attention is usually described using three kinds of vectors:

Object	Role
Query	Represents what the model is looking for
Key	Represents what each memory item contains
Value	Represents the information returned if an item is selected

For each item $i$ , we have a key $k_i$ and a value $v_i$ . Given a query $q$ , the model compares $q$ with each key $k_i$ . The comparison produces a score:

s_i = \operatorname{score}(q, k_i).

The scores are normalized with softmax:

\alpha_i = \frac{\exp(s_i)} {\sum_{j=1}^{T}\exp(s_j)}.

The attention output is then

z = \sum_{i=1}^{T} \alpha_i v_i.

Keys decide where to attend. Values decide what information is retrieved.

This separation is important. A model may use one representation to decide relevance and another representation to carry content. In transformers, queries, keys, and values are learned linear projections of hidden states.

Why Attention Helps

Attention helps neural networks for several reasons.

First, it gives the model direct access to all positions in a sequence. A token at position $t$ can use information from position $1$ without passing through $t-1$ recurrent transitions.

Second, it supports variable-length dependencies. Some outputs depend on nearby inputs. Others depend on distant inputs. Attention lets the model adapt the dependency pattern for each example and each layer.

Third, attention improves alignment. In translation, summarization, speech recognition, and image captioning, each output element often corresponds to specific input elements. Attention weights provide a soft alignment between them.

Fourth, attention is parallelizable. Unlike recurrent computation, self-attention can compute interactions among all positions in a sequence at the same time. This property made transformers efficient on modern accelerators.

Attention in Different Domains

Attention first became widely known in neural machine translation, but the same idea applies across many data types.

In natural language processing, attention lets each token look at other tokens. For example, in the sentence

\text{The animal did not cross the street because it was tired.}

the word “it” must be linked to the correct earlier noun. Attention gives the model a mechanism for this kind of contextual dependency.

In computer vision, attention lets an image patch interact with other patches. A model can connect distant regions of an image without relying only on local convolution filters.

In speech, attention can align audio frames with output tokens. This is useful because speech input and text output often have different lengths.

In retrieval-augmented models, attention can combine a user query with retrieved documents. The model uses attention to decide which retrieved passages matter for the next token.

Attention Versus Convolution and Recurrence

Attention differs from convolution and recurrence in its connectivity pattern.

A convolutional layer usually connects each position to a local neighborhood. This gives strong locality and efficient computation, but long-range dependencies require many layers or large kernels.

A recurrent layer processes positions sequentially. It can, in principle, carry information across long distances, but the path between distant tokens is long. Gradients must pass through many recurrent steps.

An attention layer connects every position to every other position directly. The path length between any two tokens is one attention operation.

Mechanism	Main pattern	Strength	Limitation
Convolution	Local neighborhood	Efficient local feature extraction	Long-range context needs depth
Recurrence	Sequential state update	Natural sequence processing	Limited parallelism
Attention	Pairwise interaction	Direct long-range dependency	Quadratic cost in sequence length

Attention therefore trades computation for connectivity. Standard self-attention over a sequence of length $T$ compares all pairs of positions, giving cost roughly proportional to $T^2$ . This cost motivates sparse attention, linear attention, sliding-window attention, and other efficient variants.

Attention as Adaptive Computation

A fixed linear layer applies the same pattern of interaction to every input. A convolution applies the same local filter across positions. Attention is more adaptive. The weights depend on the input itself.

For one sentence, a token may attend mostly to the previous word. For another sentence, the same token position may attend to a word far away. The computation changes with the content.

This input-dependent routing is one reason attention works well in language models. Meaning depends heavily on context. The same word can require different surrounding words to be interpreted correctly.

For example, the word “bank” may refer to a financial institution or the side of a river. Attention allows the representation of “bank” to use nearby context to resolve the meaning.

The Basic Attention Pattern

Most attention mechanisms follow the same abstract pattern:

Compute queries, keys, and values.
Compare each query with each key.
Normalize the scores into attention weights.
Use the weights to combine values.

In matrix form, let

Q\in\mathbb{R}^{T_q\times d_k}, \qquad K\in\mathbb{R}^{T_k\times d_k}, \qquad V\in\mathbb{R}^{T_k\times d_v}.

The attention output has shape

Z\in\mathbb{R}^{T_q\times d_v}.

Each query produces one output vector. Each output vector is a weighted combination of the value vectors.

The exact scoring function can vary. Additive attention uses a small neural network to compute scores. Dot-product attention uses inner products. Scaled dot-product attention divides by a scale factor to improve numerical stability.

These variants differ in detail, but they share the same purpose: compute content-dependent weights over a set of vectors.

Self-Attention and Cross-Attention

There are two important cases.

In self-attention, the queries, keys, and values come from the same sequence. Each token attends to other tokens in the same sequence.

For an input sequence

X = [x_1, x_2, \ldots, x_T],

self-attention computes contextualized representations

z_1, z_2, \ldots, z_T.

Each $z_i$ is influenced by other positions in $X$ .

In cross-attention, the queries come from one sequence, while the keys and values come from another sequence. This is common in encoder-decoder transformers. The decoder queries attend to encoder outputs.

Self-attention builds context within a sequence. Cross-attention transfers information between sequences.

Attention and Interpretability

Attention weights can sometimes be inspected to understand which inputs influenced an output. For example, in translation, a decoder token may place high weight on the source word it translates.

However, attention weights should be interpreted carefully. A high attention weight means a value vector contributed strongly to the attention output at that layer. It does not always mean that the original input token was causally decisive for the final prediction. Later layers, residual paths, normalization, and feedforward networks also affect the result.

Thus attention provides a useful diagnostic signal, but it is not a complete explanation of model behavior.

PyTorch View

In PyTorch, attention is usually implemented by tensor operations over batches, sequence positions, and feature dimensions.

A typical input to self-attention has shape

[B, T, D]

where:

Symbol	Meaning
$B$	Batch size
$T$	Sequence length
$D$	Embedding dimension

The model projects this input into queries, keys, and values:

Q = X @ W_q
K = X @ W_k
V = X @ W_v

The attention scores are computed by matrix multiplication:

scores = Q @ K.transpose(-2, -1)

If Q and K have shape [B, T, d_k], then scores has shape:

[B, T, T]

Each entry scores[b, i, j] measures how much token i attends to token j in batch item b.

After applying softmax, the attention weights are multiplied by the values:

weights = torch.softmax(scores, dim=-1)
Z = weights @ V

The result Z has shape [B, T, d_v].

This is the core computation behind transformer attention.

Summary

Attention was introduced to solve a basic modeling problem: neural networks need flexible access to relevant information. Fixed-size summaries are often too restrictive, especially for long sequences and structured inputs.

Attention replaces fixed compression with adaptive retrieval. A query compares itself with keys, produces weights, and uses those weights to combine values. This gives the model direct access to relevant positions, supports long-range dependencies, improves alignment, and enables highly parallel computation.

The next sections define the main forms of attention: additive attention, dot-product attention, self-attention, cross-attention, and multi-head attention.