Skip to content

Self-Attention

Self-attention is attention applied within a single sequence.

Self-attention is attention applied within a single sequence. The queries, keys, and values all come from the same input. Each position in the sequence computes a new representation by looking at other positions in that same sequence.

This is the main operation inside transformer encoders and decoders. It lets every token exchange information with every other token in one layer.

From Attention to Self-Attention

In general attention, we may have separate sources for queries, keys, and values:

Q from one sequence,K,V from another sequence. Q \text{ from one sequence}, \qquad K,V \text{ from another sequence}.

In self-attention, all three are derived from the same input sequence:

X=[x1,x2,,xT]. X = [x_1, x_2, \ldots, x_T].

Each input vector xiRDx_i\in\mathbb{R}^D is projected into three vectors:

qi=xiWQ,ki=xiWK,vi=xiWV. q_i = x_i W_Q, \qquad k_i = x_i W_K, \qquad v_i = x_i W_V.

The token at position ii then uses qiq_i to compare itself with all keys k1,,kTk_1,\ldots,k_T. The result is a weighted combination of all values v1,,vTv_1,\ldots,v_T.

Thus every output vector can depend on every input vector.

Matrix Form

Let the input sequence be stored as a matrix:

XRT×D. X\in\mathbb{R}^{T\times D}.

The learned projection matrices are:

WQRD×dk,WKRD×dk,WVRD×dv. W_Q\in\mathbb{R}^{D\times d_k}, \qquad W_K\in\mathbb{R}^{D\times d_k}, \qquad W_V\in\mathbb{R}^{D\times d_v}.

Then:

Q=XWQ,K=XWK,V=XWV. Q = XW_Q, \qquad K = XW_K, \qquad V = XW_V.

The self-attention output is:

Z=softmax(QKdk)V. Z = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Z=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V”}}

This is the same scaled dot-product attention formula. The defining feature of self-attention is that QQ, KK, and VV come from the same XX.

Token Mixing

Self-attention mixes information across tokens.

Suppose XX contains token representations:

x1,x2,x3,x4. x_1, x_2, x_3, x_4.

The output representation at position 2 is:

z2=α2,1v1+α2,2v2+α2,3v3+α2,4v4. z_2 = \alpha_{2,1}v_1 + \alpha_{2,2}v_2 + \alpha_{2,3}v_3 + \alpha_{2,4}v_4.

The weights α2,j\alpha_{2,j} depend on the query at position 2 and the keys at all positions.

This means the representation of token 2 becomes contextual. It can change depending on the surrounding tokens.

For example, the representation of the word “apple” should differ in:

SentenceLikely meaning
I ate an apple.Fruit
Apple released a new chip.Company

Self-attention allows nearby and distant context to influence the token representation.

Pairwise Interaction Matrix

The matrix

A=softmax(QKdk) A = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)

is the attention matrix.

For a sequence of length TT, its shape is:

ART×T. A\in\mathbb{R}^{T\times T}.

Each row corresponds to one query position. Each column corresponds to one key position.

The entry AijA_{ij} is the attention weight from position ii to position jj. It measures how much position ii uses information from position jj.

Each row sums to 1:

j=1TAij=1. \sum_{j=1}^{T} A_{ij}=1.

The output is:

Z=AV. Z = AV.

So self-attention can be viewed as multiplying the value matrix by an input-dependent mixing matrix.

Batched Self-Attention in PyTorch

For a batch of sequences, the input has shape:

[B, T, D]

where BB is batch size, TT is sequence length, and DD is embedding dimension.

A minimal self-attention module is:

import math
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, key_dim=None, value_dim=None):
        super().__init__()

        key_dim = key_dim or embed_dim
        value_dim = value_dim or embed_dim

        self.key_dim = key_dim

        self.q_proj = nn.Linear(embed_dim, key_dim)
        self.k_proj = nn.Linear(embed_dim, key_dim)
        self.v_proj = nn.Linear(embed_dim, value_dim)

    def forward(self, x, mask=None):
        """
        x:    [B, T, D]
        mask: broadcastable to [B, T, T]
        """

        Q = self.q_proj(x)  # [B, T, d_k]
        K = self.k_proj(x)  # [B, T, d_k]
        V = self.v_proj(x)  # [B, T, d_v]

        scores = Q @ K.transpose(-2, -1)
        scores = scores / math.sqrt(self.key_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        weights = torch.softmax(scores, dim=-1)
        out = weights @ V

        return out, weights

The shape flow is:

TensorShape
Input XX[B, T, D]
Queries QQ[B, T, d_k]
Keys KK[B, T, d_k]
Values VV[B, T, d_v]
Scores[B, T, T]
Weights[B, T, T]
Output[B, T, d_v]

The score tensor contains all pairwise token comparisons inside each sequence.

Causal Self-Attention

Autoregressive language models predict the next token from previous tokens.

At position tt, the model may use positions:

1,2,,t. 1,2,\ldots,t.

It must not use positions:

t+1,t+2,,T. t+1,t+2,\ldots,T.

Causal self-attention enforces this rule with a triangular mask.

For sequence length T=4T=4, the allowed attention pattern is:

[1000110011101111]. \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix}.

A PyTorch causal mask can be created as:

T = 4

mask = torch.tril(torch.ones(T, T)).bool()
print(mask)

Then it can be applied before softmax:

scores = scores.masked_fill(~mask, float("-inf"))
weights = torch.softmax(scores, dim=-1)

Causal self-attention is the core mechanism used in decoder-only language models.

Bidirectional Self-Attention

Encoder models usually use bidirectional self-attention. In bidirectional attention, each position may attend to all positions.

This is useful when the whole input is available at once.

Examples include:

TaskWhy bidirectional context helps
Text classificationThe whole sentence determines the label
Named entity recognitionLeft and right context identify entities
Image classificationAll patches belong to the same image
Retrieval encodingThe full document can be encoded together

Bidirectional attention is used in transformer encoders. Causal attention is used in autoregressive decoders.

Padding Masks

Sequences in a batch often have different lengths. They are padded to a common length so they can be stored in one tensor.

For example:

Original sequencePadded sequence
[12, 8, 4][12, 8, 4, 0, 0]
[9, 2, 5, 7, 3][9, 2, 5, 7, 3]

The padding tokens should not affect attention.

A padding mask marks real tokens and padding tokens:

tokens = torch.tensor([
    [12, 8, 4, 0, 0],
    [9, 2, 5, 7, 3],
])

pad_id = 0
valid = tokens != pad_id

The mask can be reshaped so it broadcasts over query positions:

key_mask = valid[:, None, :]  # [B, 1, T]

Then forbidden key positions are masked before softmax.

Padding masks are separate from causal masks. In decoder-only language models, both are often needed.

Self-Attention in Images

Self-attention can also be applied to images.

A vision transformer divides an image into patches. Each patch is flattened and projected into an embedding vector. The image becomes a sequence:

x1,x2,,xT. x_1, x_2, \ldots, x_T.

Self-attention then lets each patch interact with every other patch.

This gives the model global receptive fields from the first layer. A patch in the top-left corner can directly attend to a patch in the bottom-right corner.

This differs from convolution, where information usually spreads through local neighborhoods over many layers.

Self-Attention Versus Convolution

Self-attention and convolution both mix information across positions, but they use different assumptions.

PropertyConvolutionSelf-attention
Interaction patternLocal by defaultGlobal by default
WeightsShared fixed kernelsInput-dependent weights
Position biasStrong localityRequires position information
Cost with lengthOften linearUsually quadratic
StrengthEfficient local structureFlexible long-range structure

Convolution has a strong inductive bias for images. It assumes nearby pixels are strongly related. Self-attention has a weaker locality bias but stronger ability to model global relationships.

Modern vision systems often combine both ideas.

Positional Information

Self-attention by itself does not know token order.

If we permute the input sequence and apply self-attention without positional information, the output is permuted in the same way. The operation sees a set of vectors, not an ordered sequence.

For language, order matters:

TokensMeaning
dog bites manOne meaning
man bites dogDifferent meaning

Therefore transformers add positional information.

Common approaches include:

MethodDescription
Absolute positional embeddingsAdd a learned or fixed vector for each position
Sinusoidal encodingUse fixed sine and cosine functions
Relative position biasAdd bias terms based on token distance
Rotary position embeddingsRotate query and key vectors based on position

Positional encoding gives self-attention access to order and distance.

Computational Cost

For a sequence of length TT, self-attention forms a T×TT\times T score matrix.

The memory cost for attention weights is proportional to:

T2. T^2.

With batch size BB and number of heads HH, the attention matrix has shape:

[B,H,T,T]. [B,H,T,T].

This quadratic scaling becomes expensive for long sequences.

For example, doubling the context length from TT to 2T2T increases the attention matrix size by a factor of 4.

This motivates efficient self-attention variants, including:

MethodIdea
Local attentionAttend only to nearby positions
Sparse attentionUse selected long-range links
Linear attentionAvoid explicit T×TT\times T attention
FlashAttentionCompute exact attention with better memory behavior
Sliding-window attentionUse fixed-size local windows
State-space hybridsCombine recurrence-like memory with attention

The best method depends on the task, hardware, and sequence length.

Self-Attention as a Learned Routing Mechanism

Self-attention can be viewed as routing information between positions.

Each token decides where to read from. The attention weights determine which value vectors influence the output.

This routing is content-dependent. If the input changes, the attention matrix changes.

This is one of the main reasons transformers are powerful. The computation graph between tokens adapts to the example.

For language, this supports dependencies such as:

  • pronoun resolution,
  • subject-verb agreement,
  • modifier attachment,
  • long-range topic tracking,
  • retrieval of earlier definitions.

For images, it supports relationships such as:

  • object-part connections,
  • symmetry,
  • global shape,
  • spatial grouping,
  • scene-level context.

Summary

Self-attention applies attention within a single sequence. Queries, keys, and values are all computed from the same input.

Each position forms a contextual representation by comparing itself with all other positions and combining their value vectors. This gives direct access to long-range information and enables highly parallel computation.

Self-attention can be bidirectional, as in transformer encoders, or causal, as in autoregressive language models. Its main limitation is quadratic cost in sequence length, which motivates efficient attention variants.