Self-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task.
Self-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task.
The usual dataset contains only inputs:
A self-supervised method transforms each input into an input-target pair:
Here is the model input and is a target derived automatically from .
The model learns by solving this artificial task. The goal is not the artificial task itself. The goal is to learn representations that transfer to real tasks.
Why Self-Supervision Matters
Human labels are expensive. Unlabeled data is abundant. Text, images, audio, video, logs, and web pages exist at a much larger scale than carefully labeled datasets.
Self-supervised learning makes this data useful. It creates supervision from structure already present in the input.
Examples:
| Data type | Self-supervised signal |
|---|---|
| Text | Predict missing or next tokens |
| Image | Predict masked patches or match augmented views |
| Audio | Predict masked frames or future segments |
| Video | Predict temporal order or future frames |
| Graph | Predict masked nodes or edges |
This is why self-supervised learning is central to modern foundation models. Large language models, vision-language models, speech models, and many representation models rely on self-supervised pretraining.
Pretext Tasks
A pretext task is an automatically generated task used for pretraining.
Examples include:
| Pretext task | Model must learn |
|---|---|
| Next-token prediction | Predict the next token from previous tokens |
| Masked-token prediction | Recover hidden tokens from context |
| Image inpainting | Recover missing image regions |
| Contrastive matching | Bring related views together |
| Rotation prediction | Predict image rotation |
| Temporal order prediction | Recover sequence order |
| Audio masking | Recover hidden audio frames |
A good pretext task forces the model to learn useful structure. A poor pretext task may be easy to solve using shallow cues.
For example, if an image model learns to predict rotation, it may need to understand object orientation. But it may also exploit dataset artifacts. Modern methods prefer pretext tasks that scale well and align with downstream use.
Autoregressive Prediction
Autoregressive self-supervision predicts the next element from previous elements.
For a sequence
the model factorizes the sequence probability as
Training minimizes the negative log-likelihood:
This is the main training objective for autoregressive language models. The target token is taken directly from the sequence.
In PyTorch-style language modeling, the input is shifted by one position:
# token_ids shape: [B, T]
inputs = token_ids[:, :-1] # [B, T-1]
targets = token_ids[:, 1:] # [B, T-1]
logits = model(inputs) # [B, T-1, V]
loss = torch.nn.functional.cross_entropy(
logits.reshape(-1, logits.size(-1)),
targets.reshape(-1),
)The model receives earlier tokens and predicts later tokens.
Masked Prediction
Masked prediction hides part of the input and asks the model to reconstruct it.
For text, some tokens are replaced by a mask token. The model predicts the hidden tokens from surrounding context.
For images, patches may be removed or replaced. The model reconstructs the missing pixels, patches, or latent representations.
A masked prediction pipeline has three steps:
- Select part of the input to hide.
- Feed the corrupted input to the model.
- Compute loss only on the hidden parts.
For token data:
# token_ids shape: [B, T]
mask = torch.rand(token_ids.shape) < 0.15
masked_inputs = token_ids.clone()
masked_inputs[mask] = mask_token_id
logits = model(masked_inputs) # [B, T, V]
loss = torch.nn.functional.cross_entropy(
logits[mask],
token_ids[mask],
)Masked prediction encourages the model to learn context. Unlike autoregressive prediction, it can use both left and right context.
Contrastive Learning
Contrastive learning trains a model by comparing examples.
The core idea is simple. Two views of the same example should have similar representations. Views from different examples should have different representations.
Given an input , we create two augmented views:
An encoder maps them to representations:
The objective pulls and together while pushing away representations from other examples.
A common similarity score is cosine similarity:
A common contrastive loss is InfoNCE:
Here is a positive example for , and is a temperature parameter.
In PyTorch-like form:
import torch
import torch.nn.functional as F
def contrastive_loss(z1, z2, temperature=0.1):
z1 = F.normalize(z1, dim=-1)
z2 = F.normalize(z2, dim=-1)
logits = z1 @ z2.T
logits = logits / temperature
labels = torch.arange(z1.size(0), device=z1.device)
return F.cross_entropy(logits, labels)The diagonal pairs are positives. Off-diagonal pairs act as negatives.
Data Augmentation as Supervision
In contrastive learning, augmentation defines what the model should ignore.
For images, common augmentations include cropping, flipping, color jitter, blur, and grayscale conversion. If two augmented views should match, then the model learns invariance to those changes.
For audio, augmentations may include noise, time masking, frequency masking, or speed perturbation.
For text, augmentation is harder because small word changes can alter meaning. Common choices include dropout noise, span masking, back-translation, or using related passages.
The augmentation policy is part of the learning objective. Bad augmentations can destroy useful information. Weak augmentations may make the task too easy.
Predictive Representation Learning
Not all self-supervised learning reconstructs raw data. Some methods predict latent representations.
Instead of predicting pixels, the model predicts features produced by another network. This often improves representation quality.
A common pattern is teacher-student learning:
The student predicts the teacher’s representation from a corrupted or augmented input.
The teacher may be updated by exponential moving average:
This design appears in many modern self-supervised vision and speech systems.
Self-Supervised Learning in Vision
Self-supervised vision models commonly use one of three strategies.
| Strategy | Example objective |
|---|---|
| Contrastive learning | Match two views of the same image |
| Masked image modeling | Reconstruct masked patches |
| Teacher-student prediction | Predict teacher features |
Masked image modeling became especially important with vision transformers. An image is divided into patches. Some patches are hidden. The model learns to reconstruct the missing information.
A batch of images has shape
After patching, it may become
where is the number of patches and is the patch embedding dimension.
The same idea used for text tokens can be applied to image patches.
Self-Supervised Learning in Language
Language modeling is the dominant self-supervised method for text.
An autoregressive language model predicts the next token:
A masked language model predicts hidden tokens from context:
The input text itself supplies the target. No human needs to label each sentence.
This makes web-scale language modeling possible. A model can train on books, code, articles, documentation, dialogue, and other text sources by turning every sequence into a prediction problem.
Self-Supervised Learning in Audio and Speech
Audio models often use masked prediction.
A waveform or spectrogram is partially hidden. The model predicts missing frames or latent speech units.
Audio has strong temporal structure. Nearby frames are correlated, and long-range dependencies carry linguistic or musical information.
Self-supervised speech models can learn useful representations from unlabeled audio, then transfer to speech recognition, speaker identification, emotion detection, and translation.
Transfer Learning
Self-supervised learning is usually a pretraining stage.
The common workflow is:
- Pretrain on a large unlabeled dataset.
- Fine-tune on a smaller labeled dataset.
- Evaluate on downstream tasks.
Pretraining learns general representations. Fine-tuning adapts those representations to a specific task.
For example:
| Pretraining | Fine-tuning |
|---|---|
| Predict next token on web text | Instruction following |
| Mask image patches | Image classification |
| Contrast image-text pairs | Visual question answering |
| Mask speech frames | Speech recognition |
This workflow reduces the need for task-specific labeled data.
Benefits and Limitations
Self-supervised learning has major advantages.
It can use large unlabeled datasets. It learns transferable representations. It often improves sample efficiency. It supports foundation models that can be adapted to many tasks.
It also has limitations.
The pretext objective may not align with the downstream task. Large-scale training is expensive. The model may learn dataset biases. Evaluation can be difficult before fine-tuning. In generative systems, good predictive performance does not guarantee factuality, reasoning, or safety.
Self-supervision creates a training signal, but it does not solve the whole learning problem.
Summary
Self-supervised learning creates targets from the data itself. It sits between supervised and unsupervised learning: there are targets, but they are generated automatically rather than written by humans.
Autoregressive prediction, masked prediction, contrastive learning, and teacher-student prediction are the main forms. These methods power modern language models, vision models, speech models, and multimodal foundation models.