Statistical Language ModelsA language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence
Neural Language ModelsStatistical language models estimate probabilities from discrete counts.
Autoregressive ModelingAutoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.
Masked Language ModelingMasked language modeling trains a model to recover missing tokens from their surrounding context.
Tokenization SystemsA language model does not read raw text directly. It reads tokens. Tokenization is the process that maps a string of text into a sequence of discrete symbols, and later maps generated symbols back into text.
Subword MethodsSubword methods split text into units smaller than words but usually larger than single characters.
Embeddings and Output ProjectionsAfter tokenization, text is represented as integer token IDs.
Pretraining ObjectivesA pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case.