Skip to content

Summary and Further Reading

Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model.

Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model. The output is a weighted combination of values, where the weights are learned from query-key compatibility.

The chapter began with the motivation for attention: fixed-vector sequence representations are too restrictive for many tasks. Attention removes this bottleneck by giving each output position direct access to relevant input positions.

Additive attention computes compatibility with a learned nonlinear scoring function. Dot-product attention computes compatibility with inner products and is much more efficient on modern accelerators. Scaled dot-product attention became the standard transformer mechanism because it combines mathematical simplicity with hardware efficiency.

Self-attention applies attention within one sequence. Each token can read from other tokens and produce a contextualized representation. Cross-attention applies attention across two sources. One sequence supplies queries, while another supplies keys and values. Multi-head attention runs several attention mechanisms in parallel, allowing the model to learn multiple interaction patterns.

The main limitation is cost. Standard self-attention forms a T×TT \times T score matrix, so compute and memory grow quadratically with sequence length. This motivates FlashAttention, sparse attention, sliding-window attention, low-rank attention, linear attention, retrieval, and hybrid recurrent-attention models.

A compact view of the chapter is:

SectionMain idea
20.1 MotivationAttention solves selective access and fixed-vector bottlenecks
20.2 Additive attentionA learned neural score function computes alignment
20.3 Dot-product attentionInner products provide efficient query-key matching
20.4 Self-attentionPositions inside one sequence exchange information
20.5 Cross-attentionOne sequence reads from another sequence
20.6 Multi-head attentionSeveral attention patterns run in parallel
20.7 ComplexityStandard attention has quadratic sequence cost

Further Reading

The most important papers for this chapter are:

WorkContribution
Bahdanau, Cho, and Bengio, 2015Introduced additive attention for neural machine translation
Luong, Pham, and Manning, 2015Developed effective attention variants for translation
Vaswani et al., 2017Introduced the Transformer and scaled dot-product multi-head attention
Devlin et al., 2018Applied bidirectional transformer encoders to language understanding
Radford et al., 2018 to 2019Popularized decoder-only transformer language models
Dosovitskiy et al., 2020Applied transformer self-attention to image patches
Dao et al., 2022Introduced FlashAttention for memory-efficient exact attention

Practical Notes

When implementing attention in PyTorch, always track shapes. Most attention bugs are shape, mask, or broadcasting errors.

The standard self-attention flow is:

# X: [B, T, D]

Q = q_proj(X)  # [B, T, d_k]
K = k_proj(X)  # [B, T, d_k]
V = v_proj(X)  # [B, T, d_v]

scores = Q @ K.transpose(-2, -1)  # [B, T, T]
scores = scores / math.sqrt(d_k)

weights = torch.softmax(scores, dim=-1)  # [B, T, T]
Z = weights @ V                          # [B, T, d_v]

For production code, prefer PyTorch’s optimized attention kernels where possible:

import torch.nn.functional as F

Z = F.scaled_dot_product_attention(Q, K, V)

The manual version is useful for study. The built-in version is usually better for performance.

Chapter 21 builds on these ideas and studies complete transformer architectures: encoder blocks, decoder blocks, positional encoding, feedforward networks, residual paths, normalization, and scaling behavior.