Contrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart.
Contrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart.
These objectives are central in self-supervised learning, metric learning, retrieval, representation learning, multimodal learning, and modern embedding systems.
The basic idea is:
and
A contrastive objective needs three elements:
| Element | Meaning |
|---|---|
| Encoder | Maps inputs to representations |
| Similarity function | Measures how close two representations are |
| Positive and negative pairs | Defines what should be close or far apart |
Encoders and Representations
Let an encoder network map an input to an embedding vector:
The embedding is usually a dense vector in . The goal is to place semantically related examples near each other in this vector space.
For example, in image representation learning, two augmented views of the same image should produce nearby embeddings. In text retrieval, a query and a relevant document should produce nearby embeddings. In vision-language learning, an image and its caption should produce nearby embeddings.
In PyTorch, an encoder may look like:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Encoder(nn.Module):
def __init__(self, in_features, hidden_features, embedding_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden_features),
nn.ReLU(),
nn.Linear(hidden_features, embedding_dim),
)
def forward(self, x):
z = self.net(x)
return F.normalize(z, dim=-1)The normalization step maps embeddings to unit length. This is common when using cosine similarity or dot-product similarity.
Similarity Functions
A contrastive loss depends on a similarity function . Common choices include dot product, cosine similarity, and negative Euclidean distance.
The dot product is
Cosine similarity is
If embeddings are normalized to unit length, cosine similarity equals the dot product:
In PyTorch:
z1 = F.normalize(torch.randn(32, 128), dim=-1)
z2 = F.normalize(torch.randn(32, 128), dim=-1)
similarity = z1 @ z2.T
print(similarity.shape) # torch.Size([32, 32])The result is a pairwise similarity matrix. Entry measures the similarity between example from the first batch and example from the second batch.
Positive and Negative Pairs
A positive pair consists of two examples that should be close. A negative pair consists of two examples that should be separated.
The definition depends on the task.
| Task | Positive pair | Negative pair |
|---|---|---|
| Image self-supervision | Two augmentations of same image | Views from different images |
| Text retrieval | Query and relevant document | Query and irrelevant document |
| Image-text learning | Image and matching caption | Image and nonmatching caption |
| Face verification | Same identity | Different identity |
| Audio-text learning | Audio clip and transcript | Audio clip and wrong transcript |
Contrastive learning often avoids manual labels by constructing positives automatically. For example, two random augmentations of the same image are treated as a positive pair.
This makes contrastive learning useful for self-supervised pretraining.
InfoNCE Loss
InfoNCE is one of the most important contrastive objectives. It trains the model to identify the correct positive example among many negatives.
Suppose an anchor representation has a positive representation and a set of candidate representations containing one positive and many negatives. The loss is
Here is the temperature parameter.
The numerator increases when the anchor is similar to its positive. The denominator includes positives and negatives. The objective therefore also decreases similarity to negatives.
This is cross-entropy over similarities. The model must classify which candidate is the positive match.
Temperature
The temperature controls the sharpness of the softmax distribution.
A small temperature makes the softmax sharper. The model focuses more on the hardest negatives. A large temperature makes the distribution smoother. The model spreads probability more evenly across candidates.
The logits used by InfoNCE are
If is too small, training can become unstable. If is too large, the contrast between positives and negatives may become weak.
Temperature is usually tuned as a hyperparameter. Common values are often between and , but the best value depends on embedding normalization, batch size, model size, and task.
In-Batch Negatives
A practical strength of contrastive learning is the use of in-batch negatives. If a batch contains matched pairs, each example’s positive is its paired example, and the other examples act as negatives.
For example, suppose we have query embeddings
and document embeddings
The similarity matrix is
The correct match for query is document . Thus the target labels are
The contrastive loss is ordinary cross-entropy over the rows of .
In PyTorch:
import torch
import torch.nn.functional as F
B, d = 32, 128
queries = F.normalize(torch.randn(B, d), dim=-1)
docs = F.normalize(torch.randn(B, d), dim=-1)
temperature = 0.07
logits = queries @ docs.T / temperature
targets = torch.arange(B)
loss = F.cross_entropy(logits, targets)
print(loss)This is the core of many dual-encoder retrieval systems.
Symmetric Contrastive Loss
For paired modalities, such as image and text, training often uses a symmetric loss. The image should retrieve the correct text, and the text should retrieve the correct image.
Let
be image embeddings and
be text embeddings. The similarity matrix is
The image-to-text loss applies cross-entropy row-wise:
The text-to-image loss applies cross-entropy column-wise:
The final loss is
In PyTorch:
def symmetric_contrastive_loss(image_emb, text_emb, temperature=0.07):
image_emb = F.normalize(image_emb, dim=-1)
text_emb = F.normalize(text_emb, dim=-1)
logits = image_emb @ text_emb.T / temperature
targets = torch.arange(logits.shape[0], device=logits.device)
loss_i2t = F.cross_entropy(logits, targets)
loss_t2i = F.cross_entropy(logits.T, targets)
return 0.5 * (loss_i2t + loss_t2i)This is the standard form used in many image-text contrastive models.
Contrastive Learning for Self-Supervised Vision
In self-supervised vision, labels are often unavailable. Contrastive learning creates supervision through augmentation.
For each image, generate two different random views:
These two views form a positive pair. Other images in the batch act as negatives.
The encoder maps each view to an embedding:
The objective pulls together embeddings from the same original image and pushes apart embeddings from different images.
Common augmentations include random crop, color jitter, blur, horizontal flip, grayscale conversion, and noise.
The model learns representations useful for downstream tasks because it must preserve semantic content while ignoring nuisance variation introduced by augmentation.
Contrastive Learning for Retrieval
In retrieval, contrastive objectives train embeddings for nearest-neighbor search.
A query encoder maps a query to a vector:
A document encoder maps a document to a vector:
The score is often the dot product:
Training uses positive query-document pairs and negative documents. At inference time, document embeddings can be indexed in a vector database, and queries retrieve documents with high similarity.
This design is used in semantic search, retrieval-augmented generation, recommendation, code search, and question answering.
The key benefit is that retrieval becomes a fast approximate nearest-neighbor problem after embeddings have been computed.
Contrastive Learning for Language Models
Contrastive objectives also appear in language modeling and representation learning.
Examples include sentence embeddings, instruction retrieval, preference learning, and representation alignment.
For sentence embedding, a positive pair may be two paraphrases. A negative pair may be unrelated sentences. The model learns an embedding space where semantic similarity corresponds to vector similarity.
For instruction tuning or preference modeling, contrastive objectives can compare preferred and rejected responses:
A margin or logistic contrastive loss can train the model or reward model to score preferred outputs higher than rejected outputs.
Supervised Contrastive Loss
Supervised contrastive learning uses labels to define positives and negatives. Examples with the same class are positives. Examples with different classes are negatives.
For an anchor , let be the set of other examples in the batch with the same label. The supervised contrastive loss is
This generalizes InfoNCE from one positive per anchor to multiple positives per anchor.
Supervised contrastive learning can produce embeddings with better class structure than ordinary cross-entropy, especially when transfer or nearest-neighbor evaluation matters.
Implementing Supervised Contrastive Loss
A simple PyTorch implementation:
def supervised_contrastive_loss(embeddings, labels, temperature=0.07):
embeddings = F.normalize(embeddings, dim=-1)
logits = embeddings @ embeddings.T / temperature
labels = labels.view(-1, 1)
mask = labels.eq(labels.T)
batch_size = embeddings.shape[0]
eye = torch.eye(batch_size, dtype=torch.bool, device=embeddings.device)
mask = mask & ~eye
logits = logits.masked_fill(eye, float("-inf"))
log_probs = logits - torch.logsumexp(logits, dim=1, keepdim=True)
positives_per_anchor = mask.sum(dim=1)
valid = positives_per_anchor > 0
loss = -(log_probs * mask).sum(dim=1) / positives_per_anchor.clamp_min(1)
return loss[valid].mean()This implementation excludes each example from being its own positive. It also handles anchors that have no positive examples in the batch.
Collapse
A major risk in representation learning is collapse. Collapse occurs when the encoder maps many or all inputs to the same embedding.
If all embeddings are identical, the representation contains little useful information.
Contrastive losses reduce collapse by using negatives. If all embeddings are the same, the model cannot distinguish positives from negatives, so the loss remains high.
Some modern self-supervised methods avoid explicit negatives through architectural tricks, stop-gradient operations, variance regularization, clustering objectives, or teacher-student networks. Even then, preventing collapse remains a central design problem.
Batch Size and Negative Quality
Contrastive learning is sensitive to batch construction. Larger batches provide more in-batch negatives. More negatives usually improve the contrastive signal, especially for retrieval and image-text pretraining.
However, not all negatives are useful. Random negatives may be too easy. Hard negatives can improve learning but may introduce false negatives.
A false negative is an example treated as negative even though it is semantically related to the anchor. For example, two different captions may correctly describe the same image concept. Treating them as negatives can harm representation learning.
Good contrastive training often depends on:
| Factor | Why it matters |
|---|---|
| Batch size | Controls number of in-batch negatives |
| Data quality | Defines meaningful positives |
| Augmentation | Controls invariances learned |
| Hard negatives | Improve discrimination |
| Temperature | Controls gradient concentration |
| Embedding normalization | Stabilizes similarity scale |
Contrastive Loss Versus Cross-Entropy
Contrastive objectives often reduce to cross-entropy over similarity scores. The difference is in what the classes mean.
In ordinary classification, classes are fixed labels such as “cat,” “dog,” or “truck.”
In contrastive learning, the “class” for an anchor is its matching example among candidates. The labels are often created by the batch structure.
Thus, contrastive learning can use cross-entropy without a fixed classifier head.
| Objective | Compared items | Target |
|---|---|---|
| Cross-entropy classification | Example versus class logits | Correct class |
| Contrastive learning | Example versus example similarities | Correct match |
| Ranking loss | Positive versus negative scores | Positive ranks higher |
| Triplet loss | Anchor-positive-negative distances | Positive closer than negative |
Contrastive objectives are therefore a bridge between classification, metric learning, and retrieval.
Practical Guidelines
Use contrastive objectives when the goal is to learn a representation space rather than only predict a fixed label.
For retrieval, use dual encoders with in-batch negatives and symmetric contrastive loss when the relation is bidirectional. For self-supervised vision, choose augmentations carefully because they define the invariances the model learns. For supervised embedding learning, use labels to create multiple positives per anchor when possible.
Normalize embeddings unless there is a reason not to. Tune the temperature. Use large and diverse batches when feasible. Monitor retrieval metrics, nearest-neighbor examples, and embedding collapse indicators rather than relying only on training loss.
Contrastive objectives are powerful because they make representation learning comparative. The model learns by asking: among many candidates, which example belongs with this one?