Contrastive Objectives

Contrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart.

These objectives are central in self-supervised learning, metric learning, retrieval, representation learning, multimodal learning, and modern embedding systems.

The basic idea is:

\text{similar examples should have similar representations}

and

\text{dissimilar examples should have dissimilar representations}.

A contrastive objective needs three elements:

Element	Meaning
Encoder	Maps inputs to representations
Similarity function	Measures how close two representations are
Positive and negative pairs	Defines what should be close or far apart

Encoders and Representations

Let an encoder network map an input $x$ to an embedding vector:

z = f_\theta(x).

The embedding $z$ is usually a dense vector in $\mathbb{R}^d$ . The goal is to place semantically related examples near each other in this vector space.

For example, in image representation learning, two augmented views of the same image should produce nearby embeddings. In text retrieval, a query and a relevant document should produce nearby embeddings. In vision-language learning, an image and its caption should produce nearby embeddings.

In PyTorch, an encoder may look like:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, in_features, hidden_features, embedding_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, embedding_dim),
        )

    def forward(self, x):
        z = self.net(x)
        return F.normalize(z, dim=-1)

The normalization step maps embeddings to unit length. This is common when using cosine similarity or dot-product similarity.

Similarity Functions

A contrastive loss depends on a similarity function $s(z_i,z_j)$ . Common choices include dot product, cosine similarity, and negative Euclidean distance.

The dot product is

s(z_i,z_j) = z_i^\top z_j.

Cosine similarity is

s(z_i,z_j) = \frac{z_i^\top z_j} {\|z_i\|_2\|z_j\|_2}.

If embeddings are normalized to unit length, cosine similarity equals the dot product:

\|z_i\|_2=\|z_j\|_2=1 \quad \Rightarrow \quad s(z_i,z_j)=z_i^\top z_j.

In PyTorch:

z1 = F.normalize(torch.randn(32, 128), dim=-1)
z2 = F.normalize(torch.randn(32, 128), dim=-1)

similarity = z1 @ z2.T
print(similarity.shape)  # torch.Size([32, 32])

The result is a pairwise similarity matrix. Entry $(i,j)$ measures the similarity between example $i$ from the first batch and example $j$ from the second batch.

Positive and Negative Pairs

A positive pair consists of two examples that should be close. A negative pair consists of two examples that should be separated.

The definition depends on the task.

Task	Positive pair	Negative pair
Image self-supervision	Two augmentations of same image	Views from different images
Text retrieval	Query and relevant document	Query and irrelevant document
Image-text learning	Image and matching caption	Image and nonmatching caption
Face verification	Same identity	Different identity
Audio-text learning	Audio clip and transcript	Audio clip and wrong transcript

Contrastive learning often avoids manual labels by constructing positives automatically. For example, two random augmentations of the same image are treated as a positive pair.

This makes contrastive learning useful for self-supervised pretraining.

InfoNCE Loss

InfoNCE is one of the most important contrastive objectives. It trains the model to identify the correct positive example among many negatives.

Suppose an anchor representation $z_i$ has a positive representation $z_i^+$ and a set of candidate representations containing one positive and many negatives. The loss is

L_i = - \log \frac{ \exp(s(z_i,z_i^+)/\tau) }{ \sum_{j=1}^{N} \exp(s(z_i,z_j)/\tau) }.

Here $\tau>0$ is the temperature parameter.

The numerator increases when the anchor is similar to its positive. The denominator includes positives and negatives. The objective therefore also decreases similarity to negatives.

This is cross-entropy over similarities. The model must classify which candidate is the positive match.

Temperature

The temperature $\tau$ controls the sharpness of the softmax distribution.

A small temperature makes the softmax sharper. The model focuses more on the hardest negatives. A large temperature makes the distribution smoother. The model spreads probability more evenly across candidates.

The logits used by InfoNCE are

\frac{s(z_i,z_j)}{\tau}.

If $\tau$ is too small, training can become unstable. If $\tau$ is too large, the contrast between positives and negatives may become weak.

Temperature is usually tuned as a hyperparameter. Common values are often between $0.01$ and $0.2$ , but the best value depends on embedding normalization, batch size, model size, and task.

In-Batch Negatives

A practical strength of contrastive learning is the use of in-batch negatives. If a batch contains $B$ matched pairs, each example’s positive is its paired example, and the other $B-1$ examples act as negatives.

For example, suppose we have query embeddings

Q \in \mathbb{R}^{B \times d}

and document embeddings

D \in \mathbb{R}^{B \times d}.

The similarity matrix is

S = QD^\top.

The correct match for query $i$ is document $i$ . Thus the target labels are

[0,1,2,\ldots,B-1].

The contrastive loss is ordinary cross-entropy over the rows of $S$ .

In PyTorch:

import torch
import torch.nn.functional as F

B, d = 32, 128

queries = F.normalize(torch.randn(B, d), dim=-1)
docs = F.normalize(torch.randn(B, d), dim=-1)

temperature = 0.07
logits = queries @ docs.T / temperature

targets = torch.arange(B)

loss = F.cross_entropy(logits, targets)
print(loss)

This is the core of many dual-encoder retrieval systems.

Symmetric Contrastive Loss

For paired modalities, such as image and text, training often uses a symmetric loss. The image should retrieve the correct text, and the text should retrieve the correct image.

Let

I \in \mathbb{R}^{B \times d}

be image embeddings and

T \in \mathbb{R}^{B \times d}

be text embeddings. The similarity matrix is

S = IT^\top.

The image-to-text loss applies cross-entropy row-wise:

L_{I\to T} = \mathrm{CE}(S, [0,1,\ldots,B-1]).

The text-to-image loss applies cross-entropy column-wise:

L_{T\to I} = \mathrm{CE}(S^\top, [0,1,\ldots,B-1]).

The final loss is

L = \frac{1}{2} (L_{I\to T}+L_{T\to I}).

In PyTorch:

def symmetric_contrastive_loss(image_emb, text_emb, temperature=0.07):
    image_emb = F.normalize(image_emb, dim=-1)
    text_emb = F.normalize(text_emb, dim=-1)

    logits = image_emb @ text_emb.T / temperature
    targets = torch.arange(logits.shape[0], device=logits.device)

    loss_i2t = F.cross_entropy(logits, targets)
    loss_t2i = F.cross_entropy(logits.T, targets)

    return 0.5 * (loss_i2t + loss_t2i)

This is the standard form used in many image-text contrastive models.

Contrastive Learning for Self-Supervised Vision

In self-supervised vision, labels are often unavailable. Contrastive learning creates supervision through augmentation.

For each image, generate two different random views:

x_i^{(1)}, x_i^{(2)}.

These two views form a positive pair. Other images in the batch act as negatives.

The encoder maps each view to an embedding:

z_i^{(1)} = f_\theta(x_i^{(1)}), \qquad z_i^{(2)} = f_\theta(x_i^{(2)}).

The objective pulls together embeddings from the same original image and pushes apart embeddings from different images.

Common augmentations include random crop, color jitter, blur, horizontal flip, grayscale conversion, and noise.

The model learns representations useful for downstream tasks because it must preserve semantic content while ignoring nuisance variation introduced by augmentation.

Contrastive Learning for Retrieval

In retrieval, contrastive objectives train embeddings for nearest-neighbor search.

A query encoder maps a query to a vector:

q = f_\theta(x_q).

A document encoder maps a document to a vector:

d = g_\phi(x_d).

The score is often the dot product:

s(q,d)=q^\top d.

Training uses positive query-document pairs and negative documents. At inference time, document embeddings can be indexed in a vector database, and queries retrieve documents with high similarity.

This design is used in semantic search, retrieval-augmented generation, recommendation, code search, and question answering.

The key benefit is that retrieval becomes a fast approximate nearest-neighbor problem after embeddings have been computed.

Contrastive Learning for Language Models

Contrastive objectives also appear in language modeling and representation learning.

Examples include sentence embeddings, instruction retrieval, preference learning, and representation alignment.

For sentence embedding, a positive pair may be two paraphrases. A negative pair may be unrelated sentences. The model learns an embedding space where semantic similarity corresponds to vector similarity.

For instruction tuning or preference modeling, contrastive objectives can compare preferred and rejected responses:

s(x,y^+) > s(x,y^-).

A margin or logistic contrastive loss can train the model or reward model to score preferred outputs higher than rejected outputs.

Supervised Contrastive Loss

Supervised contrastive learning uses labels to define positives and negatives. Examples with the same class are positives. Examples with different classes are negatives.

For an anchor $i$ , let $P(i)$ be the set of other examples in the batch with the same label. The supervised contrastive loss is

L_i = - \frac{1}{|P(i)|} \sum_{p\in P(i)} \log \frac{ \exp(s(z_i,z_p)/\tau) }{ \sum_{a\neq i} \exp(s(z_i,z_a)/\tau) }.

This generalizes InfoNCE from one positive per anchor to multiple positives per anchor.

Supervised contrastive learning can produce embeddings with better class structure than ordinary cross-entropy, especially when transfer or nearest-neighbor evaluation matters.

Implementing Supervised Contrastive Loss

A simple PyTorch implementation:

def supervised_contrastive_loss(embeddings, labels, temperature=0.07):
    embeddings = F.normalize(embeddings, dim=-1)
    logits = embeddings @ embeddings.T / temperature

    labels = labels.view(-1, 1)
    mask = labels.eq(labels.T)

    batch_size = embeddings.shape[0]
    eye = torch.eye(batch_size, dtype=torch.bool, device=embeddings.device)

    mask = mask & ~eye

    logits = logits.masked_fill(eye, float("-inf"))

    log_probs = logits - torch.logsumexp(logits, dim=1, keepdim=True)

    positives_per_anchor = mask.sum(dim=1)
    valid = positives_per_anchor > 0

    loss = -(log_probs * mask).sum(dim=1) / positives_per_anchor.clamp_min(1)
    return loss[valid].mean()

This implementation excludes each example from being its own positive. It also handles anchors that have no positive examples in the batch.

Collapse

A major risk in representation learning is collapse. Collapse occurs when the encoder maps many or all inputs to the same embedding.

If all embeddings are identical, the representation contains little useful information.

Contrastive losses reduce collapse by using negatives. If all embeddings are the same, the model cannot distinguish positives from negatives, so the loss remains high.

Some modern self-supervised methods avoid explicit negatives through architectural tricks, stop-gradient operations, variance regularization, clustering objectives, or teacher-student networks. Even then, preventing collapse remains a central design problem.

Batch Size and Negative Quality

Contrastive learning is sensitive to batch construction. Larger batches provide more in-batch negatives. More negatives usually improve the contrastive signal, especially for retrieval and image-text pretraining.

However, not all negatives are useful. Random negatives may be too easy. Hard negatives can improve learning but may introduce false negatives.

A false negative is an example treated as negative even though it is semantically related to the anchor. For example, two different captions may correctly describe the same image concept. Treating them as negatives can harm representation learning.

Good contrastive training often depends on:

Factor	Why it matters
Batch size	Controls number of in-batch negatives
Data quality	Defines meaningful positives
Augmentation	Controls invariances learned
Hard negatives	Improve discrimination
Temperature	Controls gradient concentration
Embedding normalization	Stabilizes similarity scale

Contrastive Loss Versus Cross-Entropy

Contrastive objectives often reduce to cross-entropy over similarity scores. The difference is in what the classes mean.

In ordinary classification, classes are fixed labels such as “cat,” “dog,” or “truck.”

In contrastive learning, the “class” for an anchor is its matching example among candidates. The labels are often created by the batch structure.

Thus, contrastive learning can use cross-entropy without a fixed classifier head.

Objective	Compared items	Target
Cross-entropy classification	Example versus class logits	Correct class
Contrastive learning	Example versus example similarities	Correct match
Ranking loss	Positive versus negative scores	Positive ranks higher
Triplet loss	Anchor-positive-negative distances	Positive closer than negative

Contrastive objectives are therefore a bridge between classification, metric learning, and retrieval.

Practical Guidelines

Use contrastive objectives when the goal is to learn a representation space rather than only predict a fixed label.

For retrieval, use dual encoders with in-batch negatives and symmetric contrastive loss when the relation is bidirectional. For self-supervised vision, choose augmentations carefully because they define the invariances the model learns. For supervised embedding learning, use labels to create multiple positives per anchor when possible.

Normalize embeddings unless there is a reason not to. Tune the temperature. Use large and diverse batches when feasible. Monitor retrieval metrics, nearest-neighbor examples, and embedding collapse indicators rather than relying only on training loss.

Contrastive objectives are powerful because they make representation learning comparative. The model learns by asking: among many candidates, which example belongs with this one?