# Dot-Product Attention

Dot-product attention uses an inner product to measure how well a query matches a key. It is simpler than additive attention and maps efficiently to matrix multiplication. This efficiency is one reason it became the standard attention mechanism in transformers.

The mechanism follows the same retrieval pattern introduced earlier:

1. Compare queries with keys.
2. Normalize the comparison scores.
3. Use the resulting weights to combine values.

The difference is the scoring function. Dot-product attention uses vector similarity directly.

### Query-Key Similarity

Let

$$
q \in \mathbb{R}^{d_k}
$$

be a query vector, and let

$$
k_i \in \mathbb{R}^{d_k}
$$

be the key vector for item $i$.

Dot-product attention computes the score

$$
s_i = q^\top k_i.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"s_i=q^\\top k_i"}}

The score is large when $q$ and $k_i$ point in similar directions and have large norms. The score is small or negative when they point in unrelated or opposite directions.

This score measures compatibility. A compatible key receives more attention.

### Scores for Multiple Keys

Suppose we have $T$ keys:

$$
k_1, k_2, \ldots, k_T.
$$

Stack them into a matrix:

$$
K =
\begin{bmatrix}
- & k_1^\top & - \\
- & k_2^\top & - \\
& \vdots & \\
- & k_T^\top & -
\end{bmatrix}
\in \mathbb{R}^{T\times d_k}.
$$

A single query $q\in\mathbb{R}^{d_k}$ compares with all keys by

$$
s = Kq.
$$

Equivalently, if we store the query as a row vector, the scores are

$$
s = qK^\top.
$$

The result is a vector

$$
s\in\mathbb{R}^{T}.
$$

Each entry $s_i$ is the dot product between the query and one key.

### From Scores to Attention Weights

The raw scores are converted to probabilities with softmax:

$$
\alpha_i =
\frac{\exp(s_i)}
{\sum_{j=1}^{T}\exp(s_j)}.
$$

The weights are nonnegative and sum to one:

$$
\alpha_i \ge 0,
\qquad
\sum_{i=1}^{T}\alpha_i = 1.
$$

The attention output is a weighted sum of values:

$$
z = \sum_{i=1}^{T} \alpha_i v_i.
$$

Here $v_i$ is the value vector associated with key $k_i$.

### Matrix Form

Dot-product attention is usually computed in matrix form.

Let

$$
Q\in\mathbb{R}^{T_q\times d_k}
$$

be a matrix of queries,

$$
K\in\mathbb{R}^{T_k\times d_k}
$$

be a matrix of keys, and

$$
V\in\mathbb{R}^{T_k\times d_v}
$$

be a matrix of values.

The score matrix is

$$
S = QK^\top.
$$

The shape is

$$
S\in\mathbb{R}^{T_q\times T_k}.
$$

Each row corresponds to one query. Each column corresponds to one key.

After row-wise softmax, the attention output is

$$
Z = \operatorname{softmax}(QK^\top)V.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"Z=\\operatorname{softmax}(QK^\\top)V"}}

The output has shape

$$
Z\in\mathbb{R}^{T_q\times d_v}.
$$

Each output row is a weighted combination of value vectors.

### Batch Form

In PyTorch, we normally use batched tensors. Suppose:

$$
Q\in\mathbb{R}^{B\times T_q\times d_k},
$$

$$
K\in\mathbb{R}^{B\times T_k\times d_k},
$$

$$
V\in\mathbb{R}^{B\times T_k\times d_v}.
$$

Then the score tensor is

$$
S = QK^\top
$$

with shape

$$
[B, T_q, T_k].
$$

In code:

```python
scores = Q @ K.transpose(-2, -1)
```

The transpose changes the last two dimensions of `K`, so the multiplication compares every query with every key.

Then:

```python
weights = torch.softmax(scores, dim=-1)
Z = weights @ V
```

The resulting tensor `Z` has shape:

```python
[B, T_q, d_v]
```

The last dimension of `weights`, namely `T_k`, is summed against the key sequence dimension of `V`.

### Why Scaling Is Needed

Plain dot products grow in magnitude as the key dimension $d_k$ increases.

Assume the entries of $q$ and $k$ are independent random variables with mean 0 and variance 1. The dot product is

$$
q^\top k = \sum_{r=1}^{d_k} q_r k_r.
$$

Its variance grows approximately with $d_k$. Large dot products push the softmax into saturated regions. When softmax saturates, one position receives nearly all probability mass and gradients become small.

To control this effect, transformers use scaled dot-product attention:

$$
S = \frac{QK^\top}{\sqrt{d_k}}.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"S=\\frac{QK^\\top}{\\sqrt{d_k}}"}}

The scale factor keeps score magnitudes more stable as the hidden dimension changes.

### Scaled Dot-Product Attention

The full scaled dot-product attention formula is

$$
\operatorname{Attention}(Q,K,V) =
\operatorname{softmax}
\left(
\frac{QK^\top}{\sqrt{d_k}}
\right)V.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"\\operatorname{Attention}(Q,K,V)=\\operatorname{softmax}\\left(\\frac{QK^\\top}{\\sqrt{d_k}}\\right)V"}}

This is the core operation inside transformer attention layers.

The formula has three stages:

| Stage | Operation | Result |
|---|---|---|
| Score | $QK^\top$ | Pairwise query-key similarities |
| Scale and normalize | $\operatorname{softmax}(S / \sqrt{d_k})$ | Attention weights |
| Retrieve | $\operatorname{softmax}(S)V$ | Weighted value combinations |

The softmax is applied row by row, so each query produces a probability distribution over keys.

### Masks

Attention often needs masks.

A mask prevents attention to certain positions. Common cases include:

| Mask type | Purpose |
|---|---|
| Padding mask | Ignore padding tokens |
| Causal mask | Prevent attending to future tokens |
| Block mask | Restrict attention to a local or structured region |

A padding mask is needed because batches often contain sequences of different lengths. Shorter sequences are padded to match the longest sequence. The model should not treat padding tokens as real content.

A causal mask is used in autoregressive language modeling. Token $t$ may attend only to tokens at positions $1,\ldots,t$. It cannot attend to future tokens because those tokens should not be known during next-token prediction.

In practice, masking is often implemented by adding a large negative value to forbidden score entries before softmax:

```python
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = torch.softmax(scores, dim=-1)
```

After softmax, forbidden positions receive attention weight near zero.

### PyTorch Implementation

A minimal scaled dot-product attention function is:

```python
import math
import torch

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: [B, T_q, d_k]
    K: [B, T_k, d_k]
    V: [B, T_k, d_v]
    mask: broadcastable to [B, T_q, T_k]
    """

    d_k = Q.size(-1)

    scores = Q @ K.transpose(-2, -1)
    scores = scores / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))

    weights = torch.softmax(scores, dim=-1)

    output = weights @ V

    return output, weights
```

This code mirrors the mathematical definition.

The critical shape relation is:

```python
[B, T_q, d_k] @ [B, d_k, T_k] -> [B, T_q, T_k]
```

Then:

```python
[B, T_q, T_k] @ [B, T_k, d_v] -> [B, T_q, d_v]
```

### Using PyTorch Built-In Attention

PyTorch includes optimized implementations of scaled dot-product attention.

A direct call can be written as:

```python
import torch.nn.functional as F

Z = F.scaled_dot_product_attention(Q, K, V)
```

For causal attention:

```python
Z = F.scaled_dot_product_attention(
    Q,
    K,
    V,
    is_causal=True,
)
```

This built-in function may use optimized kernels when available. In practice, it is usually preferable to the manual version for production training or inference.

The manual implementation remains useful for learning and debugging.

### Numerical Stability

Attention uses softmax, and softmax can overflow if scores are too large. Implementations usually subtract the maximum score before exponentiation:

$$
\operatorname{softmax}(s_i) =
\frac{\exp(s_i - m)}
{\sum_j \exp(s_j - m)},
\qquad
m = \max_j s_j.
$$

This transformation preserves the probabilities while improving numerical stability.

PyTorch’s `torch.softmax` already handles this internally in a stable way.

Masks also require care. If every position in a row is masked, softmax receives all negative infinity values and may produce `NaN`. Robust implementations avoid fully masked rows or define special behavior for them.

### Computational Cost

Standard dot-product attention forms a score matrix with shape

$$
T_q\times T_k.
$$

For self-attention, $T_q=T_k=T$, so the score matrix has size

$$
T^2.
$$

The computational and memory cost grow quadratically with sequence length.

For a batch size $B$, number of heads $H$, and sequence length $T$, the attention weight tensor has shape:

$$
[B, H, T, T].
$$

This can become large for long-context models.

For example, if $B=1$, $H=32$, and $T=8192$, then the attention weight tensor contains:

$$
1 \cdot 32 \cdot 8192^2
$$

entries. This is more than two billion values.

This quadratic cost motivates efficient attention methods, including sliding-window attention, sparse attention, low-rank attention, memory-efficient attention kernels, and linear attention.

### Dot-Product Attention Versus Additive Attention

Dot-product attention is less expressive as a scoring function than additive attention, but it is much faster on modern hardware.

| Property | Additive attention | Dot-product attention |
|---|---|---|
| Score function | Learned nonlinear network | Inner product |
| Main operation | MLP over pairs | Matrix multiplication |
| Hardware efficiency | Lower | Higher |
| Common use | Early seq2seq models | Transformers |
| Scaling factor needed | No standard factor | Usually $1/\sqrt{d_k}$ |

The hardware efficiency matters. Matrix multiplication is one of the most optimized operations on GPUs and TPUs. Dot-product attention expresses pairwise comparison as matrix multiplication, making it suitable for large-scale training.

### Interpretation

Dot-product attention retrieves values whose keys align with the query.

If a query vector represents “what information this token needs,” then each key represents “what information this token can offer.” The dot product measures compatibility. The values carry the content that will be mixed into the output.

In self-attention, every token produces its own query, key, and value. Thus each token asks a question, offers an address, and provides content.

This viewpoint is useful but approximate. Queries, keys, and values are learned internal representations, not human-readable records.

### Summary

Dot-product attention computes attention scores by taking inner products between queries and keys. It then applies softmax and uses the resulting weights to combine values.

Scaled dot-product attention divides the scores by $\sqrt{d_k}$, which stabilizes softmax behavior for large key dimensions. Its matrix form is efficient and maps directly to optimized hardware kernels.

This mechanism is the foundation of transformer self-attention and cross-attention.

