Self-Attention

Self-attention is attention applied within a single sequence. The queries, keys, and values all come from the same input. Each position in the sequence computes a new representation by looking at other positions in that same sequence.

This is the main operation inside transformer encoders and decoders. It lets every token exchange information with every other token in one layer.

From Attention to Self-Attention

In general attention, we may have separate sources for queries, keys, and values:

Q \text{ from one sequence}, \qquad K,V \text{ from another sequence}.

In self-attention, all three are derived from the same input sequence:

X = [x_1, x_2, \ldots, x_T].

Each input vector $x_i\in\mathbb{R}^D$ is projected into three vectors:

q_i = x_i W_Q, \qquad k_i = x_i W_K, \qquad v_i = x_i W_V.

The token at position $i$ then uses $q_i$ to compare itself with all keys $k_1,\ldots,k_T$ . The result is a weighted combination of all values $v_1,\ldots,v_T$ .

Thus every output vector can depend on every input vector.

Matrix Form

Let the input sequence be stored as a matrix:

X\in\mathbb{R}^{T\times D}.

The learned projection matrices are:

W_Q\in\mathbb{R}^{D\times d_k}, \qquad W_K\in\mathbb{R}^{D\times d_k}, \qquad W_V\in\mathbb{R}^{D\times d_v}.

Then:

Q = XW_Q, \qquad K = XW_K, \qquad V = XW_V.

The self-attention output is:

Z = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Z=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V”}}

This is the same scaled dot-product attention formula. The defining feature of self-attention is that $Q$ , $K$ , and $V$ come from the same $X$ .

Token Mixing

Self-attention mixes information across tokens.

Suppose $X$ contains token representations:

x_1, x_2, x_3, x_4.

The output representation at position 2 is:

z_2 = \alpha_{2,1}v_1 + \alpha_{2,2}v_2 + \alpha_{2,3}v_3 + \alpha_{2,4}v_4.

The weights $\alpha_{2,j}$ depend on the query at position 2 and the keys at all positions.

This means the representation of token 2 becomes contextual. It can change depending on the surrounding tokens.

For example, the representation of the word “apple” should differ in:

Sentence	Likely meaning
I ate an apple.	Fruit
Apple released a new chip.	Company

Self-attention allows nearby and distant context to influence the token representation.

Pairwise Interaction Matrix

The matrix

A = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)

is the attention matrix.

For a sequence of length $T$ , its shape is:

A\in\mathbb{R}^{T\times T}.

Each row corresponds to one query position. Each column corresponds to one key position.

The entry $A_{ij}$ is the attention weight from position $i$ to position $j$ . It measures how much position $i$ uses information from position $j$ .

Each row sums to 1:

\sum_{j=1}^{T} A_{ij}=1.

The output is:

Z = AV.

So self-attention can be viewed as multiplying the value matrix by an input-dependent mixing matrix.

Batched Self-Attention in PyTorch

For a batch of sequences, the input has shape:

[B, T, D]

where $B$ is batch size, $T$ is sequence length, and $D$ is embedding dimension.

A minimal self-attention module is:

import math
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, key_dim=None, value_dim=None):
        super().__init__()

        key_dim = key_dim or embed_dim
        value_dim = value_dim or embed_dim

        self.key_dim = key_dim

        self.q_proj = nn.Linear(embed_dim, key_dim)
        self.k_proj = nn.Linear(embed_dim, key_dim)
        self.v_proj = nn.Linear(embed_dim, value_dim)

    def forward(self, x, mask=None):
        """
        x:    [B, T, D]
        mask: broadcastable to [B, T, T]
        """

        Q = self.q_proj(x)  # [B, T, d_k]
        K = self.k_proj(x)  # [B, T, d_k]
        V = self.v_proj(x)  # [B, T, d_v]

        scores = Q @ K.transpose(-2, -1)
        scores = scores / math.sqrt(self.key_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        weights = torch.softmax(scores, dim=-1)
        out = weights @ V

        return out, weights

The shape flow is:

Tensor	Shape
Input $X$	`[B, T, D]`
Queries $Q$	`[B, T, d_k]`
Keys $K$	`[B, T, d_k]`
Values $V$	`[B, T, d_v]`
Scores	`[B, T, T]`
Weights	`[B, T, T]`
Output	`[B, T, d_v]`

The score tensor contains all pairwise token comparisons inside each sequence.

Causal Self-Attention

Autoregressive language models predict the next token from previous tokens.

At position $t$ , the model may use positions:

1,2,\ldots,t.

It must not use positions:

t+1,t+2,\ldots,T.

Causal self-attention enforces this rule with a triangular mask.

For sequence length $T=4$ , the allowed attention pattern is:

\begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix}.

A PyTorch causal mask can be created as:

T = 4

mask = torch.tril(torch.ones(T, T)).bool()
print(mask)

Then it can be applied before softmax:

scores = scores.masked_fill(~mask, float("-inf"))
weights = torch.softmax(scores, dim=-1)

Causal self-attention is the core mechanism used in decoder-only language models.

Bidirectional Self-Attention

Encoder models usually use bidirectional self-attention. In bidirectional attention, each position may attend to all positions.

This is useful when the whole input is available at once.

Examples include:

Task	Why bidirectional context helps
Text classification	The whole sentence determines the label
Named entity recognition	Left and right context identify entities
Image classification	All patches belong to the same image
Retrieval encoding	The full document can be encoded together

Bidirectional attention is used in transformer encoders. Causal attention is used in autoregressive decoders.

Padding Masks

Sequences in a batch often have different lengths. They are padded to a common length so they can be stored in one tensor.

For example:

Original sequence	Padded sequence
`[12, 8, 4]`	`[12, 8, 4, 0, 0]`
`[9, 2, 5, 7, 3]`	`[9, 2, 5, 7, 3]`

The padding tokens should not affect attention.

A padding mask marks real tokens and padding tokens:

tokens = torch.tensor([
    [12, 8, 4, 0, 0],
    [9, 2, 5, 7, 3],
])

pad_id = 0
valid = tokens != pad_id

The mask can be reshaped so it broadcasts over query positions:

key_mask = valid[:, None, :]  # [B, 1, T]

Then forbidden key positions are masked before softmax.

Padding masks are separate from causal masks. In decoder-only language models, both are often needed.

Self-Attention in Images

Self-attention can also be applied to images.

A vision transformer divides an image into patches. Each patch is flattened and projected into an embedding vector. The image becomes a sequence:

x_1, x_2, \ldots, x_T.

Self-attention then lets each patch interact with every other patch.

This gives the model global receptive fields from the first layer. A patch in the top-left corner can directly attend to a patch in the bottom-right corner.

This differs from convolution, where information usually spreads through local neighborhoods over many layers.

Self-Attention Versus Convolution

Self-attention and convolution both mix information across positions, but they use different assumptions.

Property	Convolution	Self-attention
Interaction pattern	Local by default	Global by default
Weights	Shared fixed kernels	Input-dependent weights
Position bias	Strong locality	Requires position information
Cost with length	Often linear	Usually quadratic
Strength	Efficient local structure	Flexible long-range structure

Convolution has a strong inductive bias for images. It assumes nearby pixels are strongly related. Self-attention has a weaker locality bias but stronger ability to model global relationships.

Modern vision systems often combine both ideas.

Positional Information

Self-attention by itself does not know token order.

If we permute the input sequence and apply self-attention without positional information, the output is permuted in the same way. The operation sees a set of vectors, not an ordered sequence.

For language, order matters:

Tokens	Meaning
dog bites man	One meaning
man bites dog	Different meaning

Therefore transformers add positional information.

Common approaches include:

Method	Description
Absolute positional embeddings	Add a learned or fixed vector for each position
Sinusoidal encoding	Use fixed sine and cosine functions
Relative position bias	Add bias terms based on token distance
Rotary position embeddings	Rotate query and key vectors based on position

Positional encoding gives self-attention access to order and distance.

Computational Cost

For a sequence of length $T$ , self-attention forms a $T\times T$ score matrix.

The memory cost for attention weights is proportional to:

T^2.

With batch size $B$ and number of heads $H$ , the attention matrix has shape:

[B,H,T,T].

This quadratic scaling becomes expensive for long sequences.

For example, doubling the context length from $T$ to $2T$ increases the attention matrix size by a factor of 4.

This motivates efficient self-attention variants, including:

Method	Idea
Local attention	Attend only to nearby positions
Sparse attention	Use selected long-range links
Linear attention	Avoid explicit $T\times T$ attention
FlashAttention	Compute exact attention with better memory behavior
Sliding-window attention	Use fixed-size local windows
State-space hybrids	Combine recurrence-like memory with attention

The best method depends on the task, hardware, and sequence length.

Self-Attention as a Learned Routing Mechanism

Self-attention can be viewed as routing information between positions.

Each token decides where to read from. The attention weights determine which value vectors influence the output.

This routing is content-dependent. If the input changes, the attention matrix changes.

This is one of the main reasons transformers are powerful. The computation graph between tokens adapts to the example.

For language, this supports dependencies such as:

pronoun resolution,
subject-verb agreement,
modifier attachment,
long-range topic tracking,
retrieval of earlier definitions.

For images, it supports relationships such as:

object-part connections,
symmetry,
global shape,
spatial grouping,
scene-level context.

Summary

Self-attention applies attention within a single sequence. Queries, keys, and values are all computed from the same input.

Each position forms a contextual representation by comparing itself with all other positions and combining their value vectors. This gives direct access to long-range information and enables highly parallel computation.

Self-attention can be bidirectional, as in transformer encoders, or causal, as in autoregressive language models. Its main limitation is quadratic cost in sequence length, which motivates efficient attention variants.