Attention MechanismsAttention is a method for letting a model choose which parts of an input are most relevant when producing an output.
Self-AttentionSelf-attention is attention applied within a single sequence. The same input supplies the queries, keys, and values. Each position builds a new representation by reading from other positions in the same sequence.
Multi-Head AttentionMulti-head attention runs several attention operations in parallel.
Positional EncodingSelf-attention compares tokens by content. By itself, it has no built-in notion of token order.
Transformer EncodersA transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation.
Transformer DecodersA transformer decoder maps a partial output sequence to predictions for the next token or next output step.
Efficient Attention MethodsStandard self-attention compares every token with every other token. For a sequence of length $T$, this produces a $T \times T$ attention matrix. The cost grows quadratically with sequence length.