Dot-product attention uses an inner product to measure how well a query matches a key.
Dot-product attention uses an inner product to measure how well a query matches a key. It is simpler than additive attention and maps efficiently to matrix multiplication. This efficiency is one reason it became the standard attention mechanism in transformers.
The mechanism follows the same retrieval pattern introduced earlier:
- Compare queries with keys.
- Normalize the comparison scores.
- Use the resulting weights to combine values.
The difference is the scoring function. Dot-product attention uses vector similarity directly.
Query-Key Similarity
Let
be a query vector, and let
be the key vector for item .
Dot-product attention computes the score
genui{“math_block_widget_always_prefetch_v2”:{“content”:“s_i=q^\top k_i”}}
The score is large when and point in similar directions and have large norms. The score is small or negative when they point in unrelated or opposite directions.
This score measures compatibility. A compatible key receives more attention.
Scores for Multiple Keys
Suppose we have keys:
Stack them into a matrix:
$$ K = \begin{bmatrix}
- & k_1^\top & - \
- & k_2^\top & - \ & \vdots & \
- & k_T^\top & - \end{bmatrix} \in \mathbb{R}^{T\times d_k}. $$
A single query compares with all keys by
Equivalently, if we store the query as a row vector, the scores are
The result is a vector
Each entry is the dot product between the query and one key.
From Scores to Attention Weights
The raw scores are converted to probabilities with softmax:
The weights are nonnegative and sum to one:
The attention output is a weighted sum of values:
Here is the value vector associated with key .
Matrix Form
Dot-product attention is usually computed in matrix form.
Let
be a matrix of queries,
be a matrix of keys, and
be a matrix of values.
The score matrix is
The shape is
Each row corresponds to one query. Each column corresponds to one key.
After row-wise softmax, the attention output is
genui{“math_block_widget_always_prefetch_v2”:{“content”:“Z=\operatorname{softmax}(QK^\top)V”}}
The output has shape
Each output row is a weighted combination of value vectors.
Batch Form
In PyTorch, we normally use batched tensors. Suppose:
Then the score tensor is
with shape
In code:
scores = Q @ K.transpose(-2, -1)The transpose changes the last two dimensions of K, so the multiplication compares every query with every key.
Then:
weights = torch.softmax(scores, dim=-1)
Z = weights @ VThe resulting tensor Z has shape:
[B, T_q, d_v]The last dimension of weights, namely T_k, is summed against the key sequence dimension of V.
Why Scaling Is Needed
Plain dot products grow in magnitude as the key dimension increases.
Assume the entries of and are independent random variables with mean 0 and variance 1. The dot product is
Its variance grows approximately with . Large dot products push the softmax into saturated regions. When softmax saturates, one position receives nearly all probability mass and gradients become small.
To control this effect, transformers use scaled dot-product attention:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“S=\frac{QK^\top}{\sqrt{d_k}}”}}
The scale factor keeps score magnitudes more stable as the hidden dimension changes.
Scaled Dot-Product Attention
The full scaled dot-product attention formula is
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V"}}
This is the core operation inside transformer attention layers.
The formula has three stages:
| Stage | Operation | Result |
|---|---|---|
| Score | Pairwise query-key similarities | |
| Scale and normalize | Attention weights | |
| Retrieve | Weighted value combinations |
The softmax is applied row by row, so each query produces a probability distribution over keys.
Masks
Attention often needs masks.
A mask prevents attention to certain positions. Common cases include:
| Mask type | Purpose |
|---|---|
| Padding mask | Ignore padding tokens |
| Causal mask | Prevent attending to future tokens |
| Block mask | Restrict attention to a local or structured region |
A padding mask is needed because batches often contain sequences of different lengths. Shorter sequences are padded to match the longest sequence. The model should not treat padding tokens as real content.
A causal mask is used in autoregressive language modeling. Token may attend only to tokens at positions . It cannot attend to future tokens because those tokens should not be known during next-token prediction.
In practice, masking is often implemented by adding a large negative value to forbidden score entries before softmax:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = torch.softmax(scores, dim=-1)After softmax, forbidden positions receive attention weight near zero.
PyTorch Implementation
A minimal scaled dot-product attention function is:
import math
import torch
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: [B, T_q, d_k]
K: [B, T_k, d_k]
V: [B, T_k, d_v]
mask: broadcastable to [B, T_q, T_k]
"""
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1)
scores = scores / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = torch.softmax(scores, dim=-1)
output = weights @ V
return output, weightsThis code mirrors the mathematical definition.
The critical shape relation is:
[B, T_q, d_k] @ [B, d_k, T_k] -> [B, T_q, T_k]Then:
[B, T_q, T_k] @ [B, T_k, d_v] -> [B, T_q, d_v]Using PyTorch Built-In Attention
PyTorch includes optimized implementations of scaled dot-product attention.
A direct call can be written as:
import torch.nn.functional as F
Z = F.scaled_dot_product_attention(Q, K, V)For causal attention:
Z = F.scaled_dot_product_attention(
Q,
K,
V,
is_causal=True,
)This built-in function may use optimized kernels when available. In practice, it is usually preferable to the manual version for production training or inference.
The manual implementation remains useful for learning and debugging.
Numerical Stability
Attention uses softmax, and softmax can overflow if scores are too large. Implementations usually subtract the maximum score before exponentiation:
This transformation preserves the probabilities while improving numerical stability.
PyTorch’s torch.softmax already handles this internally in a stable way.
Masks also require care. If every position in a row is masked, softmax receives all negative infinity values and may produce NaN. Robust implementations avoid fully masked rows or define special behavior for them.
Computational Cost
Standard dot-product attention forms a score matrix with shape
For self-attention, , so the score matrix has size
The computational and memory cost grow quadratically with sequence length.
For a batch size , number of heads , and sequence length , the attention weight tensor has shape:
This can become large for long-context models.
For example, if , , and , then the attention weight tensor contains:
entries. This is more than two billion values.
This quadratic cost motivates efficient attention methods, including sliding-window attention, sparse attention, low-rank attention, memory-efficient attention kernels, and linear attention.
Dot-Product Attention Versus Additive Attention
Dot-product attention is less expressive as a scoring function than additive attention, but it is much faster on modern hardware.
| Property | Additive attention | Dot-product attention |
|---|---|---|
| Score function | Learned nonlinear network | Inner product |
| Main operation | MLP over pairs | Matrix multiplication |
| Hardware efficiency | Lower | Higher |
| Common use | Early seq2seq models | Transformers |
| Scaling factor needed | No standard factor | Usually |
The hardware efficiency matters. Matrix multiplication is one of the most optimized operations on GPUs and TPUs. Dot-product attention expresses pairwise comparison as matrix multiplication, making it suitable for large-scale training.
Interpretation
Dot-product attention retrieves values whose keys align with the query.
If a query vector represents “what information this token needs,” then each key represents “what information this token can offer.” The dot product measures compatibility. The values carry the content that will be mixed into the output.
In self-attention, every token produces its own query, key, and value. Thus each token asks a question, offers an address, and provides content.
This viewpoint is useful but approximate. Queries, keys, and values are learned internal representations, not human-readable records.
Summary
Dot-product attention computes attention scores by taking inner products between queries and keys. It then applies softmax and uses the resulting weights to combine values.
Scaled dot-product attention divides the scores by , which stabilizes softmax behavior for large key dimensions. Its matrix form is efficient and maps directly to optimized hardware kernels.
This mechanism is the foundation of transformer self-attention and cross-attention.