Self-Supervised Learning

Self-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task.

The usual dataset contains only inputs:

\mathcal{D} = \{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}.

A self-supervised method transforms each input into an input-target pair:

x \longrightarrow (\tilde{x}, y_{\text{pretext}}).

Here $\tilde{x}$ is the model input and $y_{\text{pretext}}$ is a target derived automatically from $x$ .

The model learns by solving this artificial task. The goal is not the artificial task itself. The goal is to learn representations that transfer to real tasks.

Why Self-Supervision Matters

Human labels are expensive. Unlabeled data is abundant. Text, images, audio, video, logs, and web pages exist at a much larger scale than carefully labeled datasets.

Self-supervised learning makes this data useful. It creates supervision from structure already present in the input.

Examples:

Data type	Self-supervised signal
Text	Predict missing or next tokens
Image	Predict masked patches or match augmented views
Audio	Predict masked frames or future segments
Video	Predict temporal order or future frames
Graph	Predict masked nodes or edges

This is why self-supervised learning is central to modern foundation models. Large language models, vision-language models, speech models, and many representation models rely on self-supervised pretraining.

Pretext Tasks

A pretext task is an automatically generated task used for pretraining.

Examples include:

Pretext task	Model must learn
Next-token prediction	Predict the next token from previous tokens
Masked-token prediction	Recover hidden tokens from context
Image inpainting	Recover missing image regions
Contrastive matching	Bring related views together
Rotation prediction	Predict image rotation
Temporal order prediction	Recover sequence order
Audio masking	Recover hidden audio frames

A good pretext task forces the model to learn useful structure. A poor pretext task may be easy to solve using shallow cues.

For example, if an image model learns to predict rotation, it may need to understand object orientation. But it may also exploit dataset artifacts. Modern methods prefer pretext tasks that scale well and align with downstream use.

Autoregressive Prediction

Autoregressive self-supervision predicts the next element from previous elements.

For a sequence

x = (x_1, x_2, \dots, x_T),

the model factorizes the sequence probability as

p_\theta(x) = \prod_{t=1}^{T} p_\theta(x_t \mid x_1, \dots, x_{t-1}).

Training minimizes the negative log-likelihood:

L(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_1, \dots, x_{t-1}).

This is the main training objective for autoregressive language models. The target token is taken directly from the sequence.

In PyTorch-style language modeling, the input is shifted by one position:

# token_ids shape: [B, T]
inputs = token_ids[:, :-1]   # [B, T-1]
targets = token_ids[:, 1:]   # [B, T-1]

logits = model(inputs)       # [B, T-1, V]

loss = torch.nn.functional.cross_entropy(
    logits.reshape(-1, logits.size(-1)),
    targets.reshape(-1),
)

The model receives earlier tokens and predicts later tokens.

Masked Prediction

Masked prediction hides part of the input and asks the model to reconstruct it.

For text, some tokens are replaced by a mask token. The model predicts the hidden tokens from surrounding context.

For images, patches may be removed or replaced. The model reconstructs the missing pixels, patches, or latent representations.

A masked prediction pipeline has three steps:

Select part of the input to hide.
Feed the corrupted input to the model.
Compute loss only on the hidden parts.

For token data:

# token_ids shape: [B, T]
mask = torch.rand(token_ids.shape) < 0.15

masked_inputs = token_ids.clone()
masked_inputs[mask] = mask_token_id

logits = model(masked_inputs)  # [B, T, V]

loss = torch.nn.functional.cross_entropy(
    logits[mask],
    token_ids[mask],
)

Masked prediction encourages the model to learn context. Unlike autoregressive prediction, it can use both left and right context.

Contrastive Learning

Contrastive learning trains a model by comparing examples.

The core idea is simple. Two views of the same example should have similar representations. Views from different examples should have different representations.

Given an input $x$ , we create two augmented views:

x_1 = a_1(x), \quad x_2 = a_2(x).

An encoder maps them to representations:

z_1 = f_\theta(x_1), \quad z_2 = f_\theta(x_2).

The objective pulls $z_1$ and $z_2$ together while pushing away representations from other examples.

A common similarity score is cosine similarity:

s(z_i, z_j) = \frac{z_i^\top z_j} {\|z_i\|\|z_j\|}.

A common contrastive loss is InfoNCE:

L_i = -\log \frac{ \exp(s(z_i,z_i^+)/\tau) }{ \sum_j \exp(s(z_i,z_j)/\tau) }.

Here $z_i^+$ is a positive example for $z_i$ , and $\tau$ is a temperature parameter.

In PyTorch-like form:

import torch
import torch.nn.functional as F

def contrastive_loss(z1, z2, temperature=0.1):
    z1 = F.normalize(z1, dim=-1)
    z2 = F.normalize(z2, dim=-1)

    logits = z1 @ z2.T
    logits = logits / temperature

    labels = torch.arange(z1.size(0), device=z1.device)

    return F.cross_entropy(logits, labels)

The diagonal pairs are positives. Off-diagonal pairs act as negatives.

Data Augmentation as Supervision

In contrastive learning, augmentation defines what the model should ignore.

For images, common augmentations include cropping, flipping, color jitter, blur, and grayscale conversion. If two augmented views should match, then the model learns invariance to those changes.

For audio, augmentations may include noise, time masking, frequency masking, or speed perturbation.

For text, augmentation is harder because small word changes can alter meaning. Common choices include dropout noise, span masking, back-translation, or using related passages.

The augmentation policy is part of the learning objective. Bad augmentations can destroy useful information. Weak augmentations may make the task too easy.

Predictive Representation Learning

Not all self-supervised learning reconstructs raw data. Some methods predict latent representations.

Instead of predicting pixels, the model predicts features produced by another network. This often improves representation quality.

A common pattern is teacher-student learning:

z_{\text{teacher}} = f_{\bar{\theta}}(x),

z_{\text{student}} = g_\theta(\tilde{x}).

The student predicts the teacher’s representation from a corrupted or augmented input.

The teacher may be updated by exponential moving average:

\bar{\theta} \leftarrow m\bar{\theta} + (1-m)\theta.

This design appears in many modern self-supervised vision and speech systems.

Self-Supervised Learning in Vision

Self-supervised vision models commonly use one of three strategies.

Strategy	Example objective
Contrastive learning	Match two views of the same image
Masked image modeling	Reconstruct masked patches
Teacher-student prediction	Predict teacher features

Masked image modeling became especially important with vision transformers. An image is divided into patches. Some patches are hidden. The model learns to reconstruct the missing information.

A batch of images has shape

[B, C, H, W].

After patching, it may become

[B, N, D],

where $N$ is the number of patches and $D$ is the patch embedding dimension.

The same idea used for text tokens can be applied to image patches.

Self-Supervised Learning in Language

Language modeling is the dominant self-supervised method for text.

An autoregressive language model predicts the next token:

p(x_t \mid x_1,\dots,x_{t-1}).

A masked language model predicts hidden tokens from context:

p(x_m \mid x_{\setminus m}).

The input text itself supplies the target. No human needs to label each sentence.

This makes web-scale language modeling possible. A model can train on books, code, articles, documentation, dialogue, and other text sources by turning every sequence into a prediction problem.

Self-Supervised Learning in Audio and Speech

Audio models often use masked prediction.

A waveform or spectrogram is partially hidden. The model predicts missing frames or latent speech units.

Audio has strong temporal structure. Nearby frames are correlated, and long-range dependencies carry linguistic or musical information.

Self-supervised speech models can learn useful representations from unlabeled audio, then transfer to speech recognition, speaker identification, emotion detection, and translation.

Transfer Learning

Self-supervised learning is usually a pretraining stage.

The common workflow is:

Pretrain on a large unlabeled dataset.
Fine-tune on a smaller labeled dataset.
Evaluate on downstream tasks.

Pretraining learns general representations. Fine-tuning adapts those representations to a specific task.

For example:

Pretraining	Fine-tuning
Predict next token on web text	Instruction following
Mask image patches	Image classification
Contrast image-text pairs	Visual question answering
Mask speech frames	Speech recognition

This workflow reduces the need for task-specific labeled data.

Benefits and Limitations

Self-supervised learning has major advantages.

It can use large unlabeled datasets. It learns transferable representations. It often improves sample efficiency. It supports foundation models that can be adapted to many tasks.

It also has limitations.

The pretext objective may not align with the downstream task. Large-scale training is expensive. The model may learn dataset biases. Evaluation can be difficult before fine-tuning. In generative systems, good predictive performance does not guarantee factuality, reasoning, or safety.

Self-supervision creates a training signal, but it does not solve the whole learning problem.

Summary

Self-supervised learning creates targets from the data itself. It sits between supervised and unsupervised learning: there are targets, but they are generated automatically rather than written by humans.

Autoregressive prediction, masked prediction, contrastive learning, and teacher-student prediction are the main forms. These methods power modern language models, vision models, speech models, and multimodal foundation models.