Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model.
Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model. The output is a weighted combination of values, where the weights are learned from query-key compatibility.
The chapter began with the motivation for attention: fixed-vector sequence representations are too restrictive for many tasks. Attention removes this bottleneck by giving each output position direct access to relevant input positions.
Additive attention computes compatibility with a learned nonlinear scoring function. Dot-product attention computes compatibility with inner products and is much more efficient on modern accelerators. Scaled dot-product attention became the standard transformer mechanism because it combines mathematical simplicity with hardware efficiency.
Self-attention applies attention within one sequence. Each token can read from other tokens and produce a contextualized representation. Cross-attention applies attention across two sources. One sequence supplies queries, while another supplies keys and values. Multi-head attention runs several attention mechanisms in parallel, allowing the model to learn multiple interaction patterns.
The main limitation is cost. Standard self-attention forms a score matrix, so compute and memory grow quadratically with sequence length. This motivates FlashAttention, sparse attention, sliding-window attention, low-rank attention, linear attention, retrieval, and hybrid recurrent-attention models.
A compact view of the chapter is:
| Section | Main idea |
|---|---|
| 20.1 Motivation | Attention solves selective access and fixed-vector bottlenecks |
| 20.2 Additive attention | A learned neural score function computes alignment |
| 20.3 Dot-product attention | Inner products provide efficient query-key matching |
| 20.4 Self-attention | Positions inside one sequence exchange information |
| 20.5 Cross-attention | One sequence reads from another sequence |
| 20.6 Multi-head attention | Several attention patterns run in parallel |
| 20.7 Complexity | Standard attention has quadratic sequence cost |
Further Reading
The most important papers for this chapter are:
| Work | Contribution |
|---|---|
| Bahdanau, Cho, and Bengio, 2015 | Introduced additive attention for neural machine translation |
| Luong, Pham, and Manning, 2015 | Developed effective attention variants for translation |
| Vaswani et al., 2017 | Introduced the Transformer and scaled dot-product multi-head attention |
| Devlin et al., 2018 | Applied bidirectional transformer encoders to language understanding |
| Radford et al., 2018 to 2019 | Popularized decoder-only transformer language models |
| Dosovitskiy et al., 2020 | Applied transformer self-attention to image patches |
| Dao et al., 2022 | Introduced FlashAttention for memory-efficient exact attention |
Practical Notes
When implementing attention in PyTorch, always track shapes. Most attention bugs are shape, mask, or broadcasting errors.
The standard self-attention flow is:
# X: [B, T, D]
Q = q_proj(X) # [B, T, d_k]
K = k_proj(X) # [B, T, d_k]
V = v_proj(X) # [B, T, d_v]
scores = Q @ K.transpose(-2, -1) # [B, T, T]
scores = scores / math.sqrt(d_k)
weights = torch.softmax(scores, dim=-1) # [B, T, T]
Z = weights @ V # [B, T, d_v]For production code, prefer PyTorch’s optimized attention kernels where possible:
import torch.nn.functional as F
Z = F.scaled_dot_product_attention(Q, K, V)The manual version is useful for study. The built-in version is usually better for performance.
Chapter 21 builds on these ideas and studies complete transformer architectures: encoder blocks, decoder blocks, positional encoding, feedforward networks, residual paths, normalization, and scaling behavior.