Skip to content

Dot-Product Attention

Dot-product attention uses an inner product to measure how well a query matches a key.

Dot-product attention uses an inner product to measure how well a query matches a key. It is simpler than additive attention and maps efficiently to matrix multiplication. This efficiency is one reason it became the standard attention mechanism in transformers.

The mechanism follows the same retrieval pattern introduced earlier:

  1. Compare queries with keys.
  2. Normalize the comparison scores.
  3. Use the resulting weights to combine values.

The difference is the scoring function. Dot-product attention uses vector similarity directly.

Query-Key Similarity

Let

qRdk q \in \mathbb{R}^{d_k}

be a query vector, and let

kiRdk k_i \in \mathbb{R}^{d_k}

be the key vector for item ii.

Dot-product attention computes the score

si=qki. s_i = q^\top k_i.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“s_i=q^\top k_i”}}

The score is large when qq and kik_i point in similar directions and have large norms. The score is small or negative when they point in unrelated or opposite directions.

This score measures compatibility. A compatible key receives more attention.

Scores for Multiple Keys

Suppose we have TT keys:

k1,k2,,kT. k_1, k_2, \ldots, k_T.

Stack them into a matrix:

$$ K = \begin{bmatrix}

  • & k_1^\top & - \
  • & k_2^\top & - \ & \vdots & \
  • & k_T^\top & - \end{bmatrix} \in \mathbb{R}^{T\times d_k}. $$

A single query qRdkq\in\mathbb{R}^{d_k} compares with all keys by

s=Kq. s = Kq.

Equivalently, if we store the query as a row vector, the scores are

s=qK. s = qK^\top.

The result is a vector

sRT. s\in\mathbb{R}^{T}.

Each entry sis_i is the dot product between the query and one key.

From Scores to Attention Weights

The raw scores are converted to probabilities with softmax:

αi=exp(si)j=1Texp(sj). \alpha_i = \frac{\exp(s_i)} {\sum_{j=1}^{T}\exp(s_j)}.

The weights are nonnegative and sum to one:

αi0,i=1Tαi=1. \alpha_i \ge 0, \qquad \sum_{i=1}^{T}\alpha_i = 1.

The attention output is a weighted sum of values:

z=i=1Tαivi. z = \sum_{i=1}^{T} \alpha_i v_i.

Here viv_i is the value vector associated with key kik_i.

Matrix Form

Dot-product attention is usually computed in matrix form.

Let

QRTq×dk Q\in\mathbb{R}^{T_q\times d_k}

be a matrix of queries,

KRTk×dk K\in\mathbb{R}^{T_k\times d_k}

be a matrix of keys, and

VRTk×dv V\in\mathbb{R}^{T_k\times d_v}

be a matrix of values.

The score matrix is

S=QK. S = QK^\top.

The shape is

SRTq×Tk. S\in\mathbb{R}^{T_q\times T_k}.

Each row corresponds to one query. Each column corresponds to one key.

After row-wise softmax, the attention output is

Z=softmax(QK)V. Z = \operatorname{softmax}(QK^\top)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Z=\operatorname{softmax}(QK^\top)V”}}

The output has shape

ZRTq×dv. Z\in\mathbb{R}^{T_q\times d_v}.

Each output row is a weighted combination of value vectors.

Batch Form

In PyTorch, we normally use batched tensors. Suppose:

QRB×Tq×dk, Q\in\mathbb{R}^{B\times T_q\times d_k}, KRB×Tk×dk, K\in\mathbb{R}^{B\times T_k\times d_k}, VRB×Tk×dv. V\in\mathbb{R}^{B\times T_k\times d_v}.

Then the score tensor is

S=QK S = QK^\top

with shape

[B,Tq,Tk]. [B, T_q, T_k].

In code:

scores = Q @ K.transpose(-2, -1)

The transpose changes the last two dimensions of K, so the multiplication compares every query with every key.

Then:

weights = torch.softmax(scores, dim=-1)
Z = weights @ V

The resulting tensor Z has shape:

[B, T_q, d_v]

The last dimension of weights, namely T_k, is summed against the key sequence dimension of V.

Why Scaling Is Needed

Plain dot products grow in magnitude as the key dimension dkd_k increases.

Assume the entries of qq and kk are independent random variables with mean 0 and variance 1. The dot product is

qk=r=1dkqrkr. q^\top k = \sum_{r=1}^{d_k} q_r k_r.

Its variance grows approximately with dkd_k. Large dot products push the softmax into saturated regions. When softmax saturates, one position receives nearly all probability mass and gradients become small.

To control this effect, transformers use scaled dot-product attention:

S=QKdk. S = \frac{QK^\top}{\sqrt{d_k}}.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“S=\frac{QK^\top}{\sqrt{d_k}}”}}

The scale factor keeps score magnitudes more stable as the hidden dimension changes.

Scaled Dot-Product Attention

The full scaled dot-product attention formula is

Attention(Q,K,V)=softmax(QKdk)V. \operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V"}}

This is the core operation inside transformer attention layers.

The formula has three stages:

StageOperationResult
ScoreQKQK^\topPairwise query-key similarities
Scale and normalizesoftmax(S/dk)\operatorname{softmax}(S / \sqrt{d_k})Attention weights
Retrievesoftmax(S)V\operatorname{softmax}(S)VWeighted value combinations

The softmax is applied row by row, so each query produces a probability distribution over keys.

Masks

Attention often needs masks.

A mask prevents attention to certain positions. Common cases include:

Mask typePurpose
Padding maskIgnore padding tokens
Causal maskPrevent attending to future tokens
Block maskRestrict attention to a local or structured region

A padding mask is needed because batches often contain sequences of different lengths. Shorter sequences are padded to match the longest sequence. The model should not treat padding tokens as real content.

A causal mask is used in autoregressive language modeling. Token tt may attend only to tokens at positions 1,,t1,\ldots,t. It cannot attend to future tokens because those tokens should not be known during next-token prediction.

In practice, masking is often implemented by adding a large negative value to forbidden score entries before softmax:

scores = scores.masked_fill(mask == 0, float("-inf"))
weights = torch.softmax(scores, dim=-1)

After softmax, forbidden positions receive attention weight near zero.

PyTorch Implementation

A minimal scaled dot-product attention function is:

import math
import torch

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: [B, T_q, d_k]
    K: [B, T_k, d_k]
    V: [B, T_k, d_v]
    mask: broadcastable to [B, T_q, T_k]
    """

    d_k = Q.size(-1)

    scores = Q @ K.transpose(-2, -1)
    scores = scores / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))

    weights = torch.softmax(scores, dim=-1)

    output = weights @ V

    return output, weights

This code mirrors the mathematical definition.

The critical shape relation is:

[B, T_q, d_k] @ [B, d_k, T_k] -> [B, T_q, T_k]

Then:

[B, T_q, T_k] @ [B, T_k, d_v] -> [B, T_q, d_v]

Using PyTorch Built-In Attention

PyTorch includes optimized implementations of scaled dot-product attention.

A direct call can be written as:

import torch.nn.functional as F

Z = F.scaled_dot_product_attention(Q, K, V)

For causal attention:

Z = F.scaled_dot_product_attention(
    Q,
    K,
    V,
    is_causal=True,
)

This built-in function may use optimized kernels when available. In practice, it is usually preferable to the manual version for production training or inference.

The manual implementation remains useful for learning and debugging.

Numerical Stability

Attention uses softmax, and softmax can overflow if scores are too large. Implementations usually subtract the maximum score before exponentiation:

softmax(si)=exp(sim)jexp(sjm),m=maxjsj. \operatorname{softmax}(s_i) = \frac{\exp(s_i - m)} {\sum_j \exp(s_j - m)}, \qquad m = \max_j s_j.

This transformation preserves the probabilities while improving numerical stability.

PyTorch’s torch.softmax already handles this internally in a stable way.

Masks also require care. If every position in a row is masked, softmax receives all negative infinity values and may produce NaN. Robust implementations avoid fully masked rows or define special behavior for them.

Computational Cost

Standard dot-product attention forms a score matrix with shape

Tq×Tk. T_q\times T_k.

For self-attention, Tq=Tk=TT_q=T_k=T, so the score matrix has size

T2. T^2.

The computational and memory cost grow quadratically with sequence length.

For a batch size BB, number of heads HH, and sequence length TT, the attention weight tensor has shape:

[B,H,T,T]. [B, H, T, T].

This can become large for long-context models.

For example, if B=1B=1, H=32H=32, and T=8192T=8192, then the attention weight tensor contains:

13281922 1 \cdot 32 \cdot 8192^2

entries. This is more than two billion values.

This quadratic cost motivates efficient attention methods, including sliding-window attention, sparse attention, low-rank attention, memory-efficient attention kernels, and linear attention.

Dot-Product Attention Versus Additive Attention

Dot-product attention is less expressive as a scoring function than additive attention, but it is much faster on modern hardware.

PropertyAdditive attentionDot-product attention
Score functionLearned nonlinear networkInner product
Main operationMLP over pairsMatrix multiplication
Hardware efficiencyLowerHigher
Common useEarly seq2seq modelsTransformers
Scaling factor neededNo standard factorUsually 1/dk1/\sqrt{d_k}

The hardware efficiency matters. Matrix multiplication is one of the most optimized operations on GPUs and TPUs. Dot-product attention expresses pairwise comparison as matrix multiplication, making it suitable for large-scale training.

Interpretation

Dot-product attention retrieves values whose keys align with the query.

If a query vector represents “what information this token needs,” then each key represents “what information this token can offer.” The dot product measures compatibility. The values carry the content that will be mixed into the output.

In self-attention, every token produces its own query, key, and value. Thus each token asks a question, offers an address, and provides content.

This viewpoint is useful but approximate. Queries, keys, and values are learned internal representations, not human-readable records.

Summary

Dot-product attention computes attention scores by taking inner products between queries and keys. It then applies softmax and uses the resulting weights to combine values.

Scaled dot-product attention divides the scores by dk\sqrt{d_k}, which stabilizes softmax behavior for large key dimensions. Its matrix form is efficient and maps directly to optimized hardware kernels.

This mechanism is the foundation of transformer self-attention and cross-attention.