Self-attention is attention applied within a single sequence. The queries, keys, and values all come from the same input. Each position in the sequence computes a new representation by looking at other positions in that same sequence.
This is the main operation inside transformer encoders and decoders. It lets every token exchange information with every other token in one layer.
From Attention to Self-Attention
In general attention, we may have separate sources for queries, keys, and values:
In self-attention, all three are derived from the same input sequence:
Each input vector is projected into three vectors:
The token at position then uses to compare itself with all keys . The result is a weighted combination of all values .
Thus every output vector can depend on every input vector.
Matrix Form
Let the input sequence be stored as a matrix:
The learned projection matrices are:
Then:
The self-attention output is:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“Z=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V”}}
This is the same scaled dot-product attention formula. The defining feature of self-attention is that , , and come from the same .
Token Mixing
Self-attention mixes information across tokens.
Suppose contains token representations:
The output representation at position 2 is:
The weights depend on the query at position 2 and the keys at all positions.
This means the representation of token 2 becomes contextual. It can change depending on the surrounding tokens.
For example, the representation of the word “apple” should differ in:
| Sentence | Likely meaning |
|---|---|
| I ate an apple. | Fruit |
| Apple released a new chip. | Company |
Self-attention allows nearby and distant context to influence the token representation.
Pairwise Interaction Matrix
The matrix
is the attention matrix.
For a sequence of length , its shape is:
Each row corresponds to one query position. Each column corresponds to one key position.
The entry is the attention weight from position to position . It measures how much position uses information from position .
Each row sums to 1:
The output is:
So self-attention can be viewed as multiplying the value matrix by an input-dependent mixing matrix.
Batched Self-Attention in PyTorch
For a batch of sequences, the input has shape:
[B, T, D]where is batch size, is sequence length, and is embedding dimension.
A minimal self-attention module is:
import math
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, embed_dim, key_dim=None, value_dim=None):
super().__init__()
key_dim = key_dim or embed_dim
value_dim = value_dim or embed_dim
self.key_dim = key_dim
self.q_proj = nn.Linear(embed_dim, key_dim)
self.k_proj = nn.Linear(embed_dim, key_dim)
self.v_proj = nn.Linear(embed_dim, value_dim)
def forward(self, x, mask=None):
"""
x: [B, T, D]
mask: broadcastable to [B, T, T]
"""
Q = self.q_proj(x) # [B, T, d_k]
K = self.k_proj(x) # [B, T, d_k]
V = self.v_proj(x) # [B, T, d_v]
scores = Q @ K.transpose(-2, -1)
scores = scores / math.sqrt(self.key_dim)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = torch.softmax(scores, dim=-1)
out = weights @ V
return out, weightsThe shape flow is:
| Tensor | Shape |
|---|---|
| Input | [B, T, D] |
| Queries | [B, T, d_k] |
| Keys | [B, T, d_k] |
| Values | [B, T, d_v] |
| Scores | [B, T, T] |
| Weights | [B, T, T] |
| Output | [B, T, d_v] |
The score tensor contains all pairwise token comparisons inside each sequence.
Causal Self-Attention
Autoregressive language models predict the next token from previous tokens.
At position , the model may use positions:
It must not use positions:
Causal self-attention enforces this rule with a triangular mask.
For sequence length , the allowed attention pattern is:
A PyTorch causal mask can be created as:
T = 4
mask = torch.tril(torch.ones(T, T)).bool()
print(mask)Then it can be applied before softmax:
scores = scores.masked_fill(~mask, float("-inf"))
weights = torch.softmax(scores, dim=-1)Causal self-attention is the core mechanism used in decoder-only language models.
Bidirectional Self-Attention
Encoder models usually use bidirectional self-attention. In bidirectional attention, each position may attend to all positions.
This is useful when the whole input is available at once.
Examples include:
| Task | Why bidirectional context helps |
|---|---|
| Text classification | The whole sentence determines the label |
| Named entity recognition | Left and right context identify entities |
| Image classification | All patches belong to the same image |
| Retrieval encoding | The full document can be encoded together |
Bidirectional attention is used in transformer encoders. Causal attention is used in autoregressive decoders.
Padding Masks
Sequences in a batch often have different lengths. They are padded to a common length so they can be stored in one tensor.
For example:
| Original sequence | Padded sequence |
|---|---|
[12, 8, 4] | [12, 8, 4, 0, 0] |
[9, 2, 5, 7, 3] | [9, 2, 5, 7, 3] |
The padding tokens should not affect attention.
A padding mask marks real tokens and padding tokens:
tokens = torch.tensor([
[12, 8, 4, 0, 0],
[9, 2, 5, 7, 3],
])
pad_id = 0
valid = tokens != pad_idThe mask can be reshaped so it broadcasts over query positions:
key_mask = valid[:, None, :] # [B, 1, T]Then forbidden key positions are masked before softmax.
Padding masks are separate from causal masks. In decoder-only language models, both are often needed.
Self-Attention in Images
Self-attention can also be applied to images.
A vision transformer divides an image into patches. Each patch is flattened and projected into an embedding vector. The image becomes a sequence:
Self-attention then lets each patch interact with every other patch.
This gives the model global receptive fields from the first layer. A patch in the top-left corner can directly attend to a patch in the bottom-right corner.
This differs from convolution, where information usually spreads through local neighborhoods over many layers.
Self-Attention Versus Convolution
Self-attention and convolution both mix information across positions, but they use different assumptions.
| Property | Convolution | Self-attention |
|---|---|---|
| Interaction pattern | Local by default | Global by default |
| Weights | Shared fixed kernels | Input-dependent weights |
| Position bias | Strong locality | Requires position information |
| Cost with length | Often linear | Usually quadratic |
| Strength | Efficient local structure | Flexible long-range structure |
Convolution has a strong inductive bias for images. It assumes nearby pixels are strongly related. Self-attention has a weaker locality bias but stronger ability to model global relationships.
Modern vision systems often combine both ideas.
Positional Information
Self-attention by itself does not know token order.
If we permute the input sequence and apply self-attention without positional information, the output is permuted in the same way. The operation sees a set of vectors, not an ordered sequence.
For language, order matters:
| Tokens | Meaning |
|---|---|
| dog bites man | One meaning |
| man bites dog | Different meaning |
Therefore transformers add positional information.
Common approaches include:
| Method | Description |
|---|---|
| Absolute positional embeddings | Add a learned or fixed vector for each position |
| Sinusoidal encoding | Use fixed sine and cosine functions |
| Relative position bias | Add bias terms based on token distance |
| Rotary position embeddings | Rotate query and key vectors based on position |
Positional encoding gives self-attention access to order and distance.
Computational Cost
For a sequence of length , self-attention forms a score matrix.
The memory cost for attention weights is proportional to:
With batch size and number of heads , the attention matrix has shape:
This quadratic scaling becomes expensive for long sequences.
For example, doubling the context length from to increases the attention matrix size by a factor of 4.
This motivates efficient self-attention variants, including:
| Method | Idea |
|---|---|
| Local attention | Attend only to nearby positions |
| Sparse attention | Use selected long-range links |
| Linear attention | Avoid explicit attention |
| FlashAttention | Compute exact attention with better memory behavior |
| Sliding-window attention | Use fixed-size local windows |
| State-space hybrids | Combine recurrence-like memory with attention |
The best method depends on the task, hardware, and sequence length.
Self-Attention as a Learned Routing Mechanism
Self-attention can be viewed as routing information between positions.
Each token decides where to read from. The attention weights determine which value vectors influence the output.
This routing is content-dependent. If the input changes, the attention matrix changes.
This is one of the main reasons transformers are powerful. The computation graph between tokens adapts to the example.
For language, this supports dependencies such as:
- pronoun resolution,
- subject-verb agreement,
- modifier attachment,
- long-range topic tracking,
- retrieval of earlier definitions.
For images, it supports relationships such as:
- object-part connections,
- symmetry,
- global shape,
- spatial grouping,
- scene-level context.
Summary
Self-attention applies attention within a single sequence. Queries, keys, and values are all computed from the same input.
Each position forms a contextual representation by comparing itself with all other positions and combining their value vectors. This gives direct access to long-range information and enables highly parallel computation.
Self-attention can be bidirectional, as in transformer encoders, or causal, as in autoregressive language models. Its main limitation is quadratic cost in sequence length, which motivates efficient attention variants.