Summary and Further Reading

Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model. The output is a weighted combination of values, where the weights are learned from query-key compatibility.

The chapter began with the motivation for attention: fixed-vector sequence representations are too restrictive for many tasks. Attention removes this bottleneck by giving each output position direct access to relevant input positions.

Additive attention computes compatibility with a learned nonlinear scoring function. Dot-product attention computes compatibility with inner products and is much more efficient on modern accelerators. Scaled dot-product attention became the standard transformer mechanism because it combines mathematical simplicity with hardware efficiency.

Self-attention applies attention within one sequence. Each token can read from other tokens and produce a contextualized representation. Cross-attention applies attention across two sources. One sequence supplies queries, while another supplies keys and values. Multi-head attention runs several attention mechanisms in parallel, allowing the model to learn multiple interaction patterns.

The main limitation is cost. Standard self-attention forms a $T \times T$ score matrix, so compute and memory grow quadratically with sequence length. This motivates FlashAttention, sparse attention, sliding-window attention, low-rank attention, linear attention, retrieval, and hybrid recurrent-attention models.

A compact view of the chapter is:

Section	Main idea
20.1 Motivation	Attention solves selective access and fixed-vector bottlenecks
20.2 Additive attention	A learned neural score function computes alignment
20.3 Dot-product attention	Inner products provide efficient query-key matching
20.4 Self-attention	Positions inside one sequence exchange information
20.5 Cross-attention	One sequence reads from another sequence
20.6 Multi-head attention	Several attention patterns run in parallel
20.7 Complexity	Standard attention has quadratic sequence cost

Work	Contribution
Bahdanau, Cho, and Bengio, 2015	Introduced additive attention for neural machine translation
Luong, Pham, and Manning, 2015	Developed effective attention variants for translation
Vaswani et al., 2017	Introduced the Transformer and scaled dot-product multi-head attention
Devlin et al., 2018	Applied bidirectional transformer encoders to language understanding
Radford et al., 2018 to 2019	Popularized decoder-only transformer language models
Dosovitskiy et al., 2020	Applied transformer self-attention to image patches
Dao et al., 2022	Introduced FlashAttention for memory-efficient exact attention

Practical Notes

When implementing attention in PyTorch, always track shapes. Most attention bugs are shape, mask, or broadcasting errors.

The standard self-attention flow is:

# X: [B, T, D]

Q = q_proj(X)  # [B, T, d_k]
K = k_proj(X)  # [B, T, d_k]
V = v_proj(X)  # [B, T, d_v]

scores = Q @ K.transpose(-2, -1)  # [B, T, T]
scores = scores / math.sqrt(d_k)

weights = torch.softmax(scores, dim=-1)  # [B, T, T]
Z = weights @ V                          # [B, T, d_v]

For production code, prefer PyTorch’s optimized attention kernels where possible:

import torch.nn.functional as F

Z = F.scaled_dot_product_attention(Q, K, V)

The manual version is useful for study. The built-in version is usually better for performance.

Chapter 21 builds on these ideas and studies complete transformer architectures: encoder blocks, decoder blocks, positional encoding, feedforward networks, residual paths, normalization, and scaling behavior.