Transformer EncodersA transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.
Transformer DecodersA transformer decoder is a neural network block that maps a prefix sequence to a sequence of next-token representations. It is used when the model must generate output one step at a time.
Positional EncodingSelf-attention compares tokens to other tokens, but by itself it has no built-in notion of order.
Residual and Normalization LayersTransformer layers are deep stacks of attention and feedforward blocks.
Scaling TransformersScaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput.
Efficient TransformersStandard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size
Sparse Expert ArchitecturesDense transformers activate every parameter for every token.