Motivation for AttentionSequence models often need to decide which parts of an input are relevant to a particular output.
Additive AttentionAdditive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.
Dot-Product AttentionDot-product attention uses an inner product to measure how well a query matches a key.
Self-AttentionSelf-attention is attention applied within a single sequence.
Cross-AttentionCross-attention is attention between two different sequences or sources of information. The queries come from one sequence, while the keys and values come from another.
Multi-Head AttentionMulti-head attention runs several attention operations in parallel. Each head has its own query, key, and value projections. The outputs of the heads are concatenated and projected back to the model dimension.
Attention ComplexityAttention gives a model direct access between positions in a sequence.
Summary and Further ReadingAttention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model.