Self-Supervised Objectives

Self-supervised learning trains a model using supervision constructed from the data itself. Instead of requiring human labels, the training task is derived from structure already present in the input.

A language model predicts missing or future tokens. A vision model may predict whether two augmented images came from the same source image. An audio model may predict masked spectrogram frames. A multimodal model may match an image with its caption.

The target still exists. The difference is how the target is produced. In supervised learning, humans or external systems provide labels. In self-supervised learning, the data generation procedure provides labels.

Pretext Tasks

A self-supervised objective usually defines a pretext task. A pretext task is an artificial training task designed to make the model learn useful representations.

Examples include:

Domain	Pretext task
Text	Predict the next token
Text	Predict masked tokens
Vision	Match two augmented views
Vision	Reconstruct masked patches
Audio	Predict masked acoustic frames
Multimodal	Match paired image and text
Graphs	Predict missing nodes or edges

The pretext task should force the model to learn structure useful beyond the task itself. For example, next-token prediction forces a language model to learn syntax, semantics, facts, style, and reasoning patterns because all of these help predict text.

Autoregressive Prediction

Autoregressive self-supervision predicts the next element from previous elements.

For a sequence

x = (x_1, x_2, \ldots, x_T),

the model factorizes the probability as

p_\theta(x) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{<t}).

The training objective is negative log-likelihood:

L = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).

This is the standard objective for GPT-style language models.

In PyTorch, logits often have shape [B, T, V], where $V$ is vocabulary size:

import torch
import torch.nn.functional as F

B, T, V = 4, 8, 1000

logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

This objective teaches the model to assign high probability to the observed next token.

Masked Prediction

Masked prediction hides part of the input and trains the model to reconstruct it.

For text, some tokens are replaced by mask tokens or otherwise hidden. The model predicts the original tokens from surrounding context.

For images, some patches are hidden. The model reconstructs pixels, patch embeddings, or discrete visual tokens.

For audio, some time-frequency regions are hidden. The model reconstructs masked acoustic content.

The general objective is

L = -\sum_{t\in M} \log p_\theta(x_t \mid x_{\setminus M}),

where $M$ is the set of masked positions, and $x_{\setminus M}$ denotes the visible part of the input.

A simple masked-token loss in PyTorch:

B, T, V = 4, 16, 1000

logits = torch.randn(B, T, V)
targets = torch.randint(0, V, (B, T))

mask = torch.rand(B, T) < 0.15

loss_per_token = F.cross_entropy(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
    reduction="none",
).reshape(B, T)

loss = (loss_per_token * mask).sum() / mask.sum().clamp_min(1)

Only masked positions contribute to the loss.

Denoising Objectives

A denoising objective corrupts the input and trains the model to recover the original.

Let $x$ be clean data and $\tilde{x}$ be a corrupted version. The model learns

f_\theta(\tilde{x}) \approx x.

A simple denoising loss is

L = \|f_\theta(\tilde{x}) - x\|_2^2.

In PyTorch:

x = torch.randn(32, 100)

noise = 0.1 * torch.randn_like(x)
x_noisy = x + noise

x_hat = model(x_noisy)

loss = F.mse_loss(x_hat, x)

Denoising appears in denoising autoencoders, masked autoencoders, and diffusion models. The key idea is that recovering the clean signal requires the model to learn regularities in the data distribution.

Contrastive Self-Supervision

Contrastive self-supervision creates positive and negative pairs automatically.

For images, two augmentations of the same image are positives. Augmentations of different images are negatives.

Let

z_i^{(1)} = f_\theta(x_i^{(1)}), \qquad z_i^{(2)} = f_\theta(x_i^{(2)}).

The objective pulls together $z_i^{(1)}$ and $z_i^{(2)}$ , while pushing apart embeddings from different images.

Using in-batch negatives:

z1 = F.normalize(torch.randn(32, 128), dim=-1)
z2 = F.normalize(torch.randn(32, 128), dim=-1)

temperature = 0.07
logits = z1 @ z2.T / temperature
targets = torch.arange(z1.shape[0])

loss = F.cross_entropy(logits, targets)

This is the same basic structure as InfoNCE. It turns representation learning into a classification problem over matching pairs.

Reconstruction Objectives

Reconstruction objectives train a model to reproduce its input or parts of its input.

An autoencoder consists of an encoder and decoder:

z = f_\theta(x), \qquad \hat{x} = g_\phi(z).

The reconstruction loss may be

L = \|x-\hat{x}\|_2^2.

For binary data, binary cross-entropy may be more appropriate:

L = - \sum_j \left[ x_j\log \hat{x}_j + (1-x_j)\log(1-\hat{x}_j) \right].

Reconstruction is useful when the model must preserve information about the input. But naive reconstruction can waste capacity on low-level details. For representation learning, masked reconstruction often works better because the model must infer missing structure rather than copy visible input.

Predictive Coding

Predictive coding trains a model to predict future, missing, or latent parts of the input.

In sequence data, a model may encode previous observations and predict future observations:

z_t = f_\theta(x_{\leq t}),

\hat{x}_{t+k} = g_\phi(z_t).

The loss may be a reconstruction loss, a classification loss, or a contrastive loss.

Predictive coding is useful when data has temporal or spatial structure. It appears in audio representation learning, video modeling, reinforcement learning, and world models.

Self-Supervision in Language Models

Language modeling is the most important example of self-supervised learning.

Given raw text, no manual label is needed. The sequence itself provides targets. For an autoregressive model, each token is a target for the previous context.

For a sequence of token IDs:

tokens = torch.tensor([
    [10, 15, 92, 31, 7],
    [44, 12, 8, 19, 3],
])

The input and target are shifted versions of the same sequence:

inputs = tokens[:, :-1]
targets = tokens[:, 1:]

The model receives

10 15 92 31

and predicts

15 92 31 7

This simple shift creates billions or trillions of training labels from raw text.

Self-Supervision in Vision Models

Vision models often use augmentation or masking.

Contrastive vision methods rely on augmentations. The model should produce similar representations for two views of the same image, even if one is cropped, color-shifted, or blurred.

Masked image modeling hides image patches and trains the model to reconstruct them. If an image is divided into patches, a random subset is removed. The model must infer missing patches from visible patches.

The objective may reconstruct raw pixels:

L = \|\hat{x}_M - x_M\|_2^2,

or predict discrete patch tokens using cross-entropy.

Masked reconstruction is effective because vision contains strong spatial redundancy. Nearby patches and global structure help predict missing content.

Self-Supervision in Multimodal Models

Multimodal self-supervision uses natural pairings in data.

An image and its caption form a positive pair. An audio clip and its transcript form a positive pair. A video frame and its narration form a positive pair.

A common objective is symmetric contrastive learning:

L = \frac{1}{2} \left( L_{\text{image}\to\text{text}} + L_{\text{text}\to\text{image}} \right).

This trains both modalities into a shared embedding space. Once trained, the model can perform image search with text queries, text search with images, zero-shot classification, and retrieval-augmented generation.

Teacher-Student Objectives

Some self-supervised methods use a teacher model to generate targets for a student model.

The teacher may be an exponential moving average of the student:

\theta_{\text{teacher}} \leftarrow \alpha \theta_{\text{teacher}} + (1-\alpha)\theta_{\text{student}}.

The student learns to match the teacher’s outputs under different views or augmentations.

A simple consistency loss is

L = \|f_{\theta_s}(x_1) - f_{\theta_t}(x_2)\|_2^2.

Teacher-student methods can reduce the need for explicit negatives, but they require careful design to avoid collapse.

Avoiding Representation Collapse

Representation collapse occurs when the model maps many different inputs to the same representation. A collapsed representation has low information content.

Contrastive learning avoids collapse by using negatives. If all embeddings are identical, the model cannot identify correct pairs among negatives.

Non-contrastive methods use other mechanisms, such as stop-gradient, predictor heads, variance regularization, covariance regularization, clustering targets, or teacher-student averaging.

A practical diagnostic is to monitor embedding variance. If every embedding dimension has near-zero variance across a batch, the representation may be collapsing.

z = F.normalize(torch.randn(256, 128), dim=-1)

std_per_dim = z.std(dim=0)
collapse_score = std_per_dim.mean()

print(collapse_score)

This does not prove representation quality, but it can reveal a severe failure mode.

Data Augmentation as Objective Design

In self-supervised learning, augmentations define invariances.

If two augmented views are treated as the same example, the model is encouraged to ignore the differences between them.

For images, random crops teach some translation and scale invariance. Color jitter teaches partial color invariance. Blur teaches robustness to local detail. But overly strong augmentation can destroy semantic content.

For audio, time masking and frequency masking teach robustness to missing acoustic regions. For text, augmentation is harder because small changes can alter meaning.

Augmentation design is therefore part of objective design. The loss function alone does not define the training signal. The data transformation pipeline also defines what the model learns to ignore and what it must preserve.

Self-Supervised Pretraining and Fine-Tuning

A common workflow has two stages.

First, pretrain the model with a self-supervised objective on large unlabeled data:

\theta_0 = \arg\min_\theta L_{\text{self-supervised}}(\theta).

Second, fine-tune the model on a smaller labeled dataset:

\theta^* = \arg\min_\theta L_{\text{supervised}}(\theta).

Pretraining learns general representations. Fine-tuning adapts them to a specific task.

This pattern is used in language, vision, audio, biology, code, robotics, and multimodal learning.

When Self-Supervision Works Well

Self-supervised learning works best when unlabeled data is abundant, labels are expensive, the data has strong internal structure, and the pretext task aligns with downstream tasks.

It is especially effective when scale matters. Large language models show that simple self-supervised next-token prediction can produce broad capabilities when trained on enough data with enough model capacity.

But self-supervision is not automatic. A poorly chosen pretext task may lead to representations that solve the artificial task while transferring poorly to real tasks.

Practical Guidelines

Use autoregressive prediction for sequence generation. Use masked prediction when bidirectional context is useful. Use contrastive learning for retrieval and embedding spaces. Use reconstruction when preserving input structure matters. Use teacher-student or variance-regularized objectives when negatives are unavailable or expensive.

Always evaluate self-supervised representations on downstream tasks. The pretraining loss alone is insufficient. A lower reconstruction loss or contrastive loss does not guarantee better transfer.

Self-supervised objectives matter because they convert raw data into training signal. They are one of the main reasons modern deep learning can scale beyond manually labeled datasets.