Representation Learning

Representation learning is the study of how models learn useful internal descriptions of data. Instead of relying only on hand-designed features, a neural network learns features from examples. These learned features are then used for prediction, reconstruction, retrieval, generation, or control.

An input may begin as raw data:

x \in \mathbb{R}^D.

A model maps it to a representation:

h = f_\theta(x),

where

h \in \mathbb{R}^d.

The vector $h$ may be called a feature vector, embedding, hidden state, code, or representation. The name depends on the context, but the role is the same: it is a transformed version of the input that should make useful structure easier to access.

What Makes a Representation Useful?

A useful representation preserves information that matters and removes information that does not. The relevant information depends on the task.

For image classification, a useful representation should preserve object identity and discard small changes in lighting, position, or background. For speech recognition, it should preserve phonetic content and reduce irrelevant speaker or microphone variation. For retrieval, it should place semantically similar items near each other. For generation, it should preserve enough information to reconstruct or synthesize data.

A good representation often has several properties:

Property	Meaning
Compactness	Uses fewer or more organized coordinates
Invariance	Ignores nuisance variation
Equivariance	Changes predictably under transformations
Separability	Makes classes or concepts easier to distinguish
Smoothness	Nearby representations decode or behave similarly
Compositionality	Combines simpler factors into richer meanings

No single property is always best. The right representation depends on the objective.

Hand-Designed Features Versus Learned Features

Before deep learning became dominant, many systems depended on hand-designed features. In vision, engineers designed edge detectors, texture descriptors, and shape features. In language, systems used word counts, part-of-speech tags, syntactic patterns, and dictionaries. In speech, systems used spectral features such as MFCCs.

Deep learning changes this pattern. The model learns multiple layers of features jointly with the task objective.

A simple classifier can be written as

\hat{y} = c_\psi(f_\theta(x)),

where $f_\theta$ is the representation learner and $c_\psi$ is the prediction head.

Training adjusts both $\theta$ and $\psi$ , so the representation becomes useful for prediction.

This joint learning is one reason deep networks perform well on raw or lightly processed data. The features adapt to the data distribution and the task.

Layered Representations

Deep networks build representations in layers. Each layer transforms the previous representation:

h_0 = x,

h_1 = f_1(h_0),

h_2 = f_2(h_1),

\cdots

h_L = f_L(h_{L-1}).

Early layers often capture local or low-level patterns. Later layers often capture more abstract structure.

In a convolutional network for images, early layers may detect edges and textures. Middle layers may detect parts. Later layers may detect object-level patterns.

In a transformer language model, early layers may capture token identity and local syntax. Middle layers may capture phrases, entities, and dependencies. Later layers may capture task-relevant abstractions and output-oriented features.

This hierarchy is not perfectly clean, but it is a useful organizing principle.

Supervised Representation Learning

In supervised learning, representations are learned from labeled examples:

(x_i, y_i).

The model is trained to predict $y_i$ from $x_i$ . The loss may be cross-entropy for classification:

L = -\log p_\theta(y_i \mid x_i).

The hidden layers learn features that help reduce prediction error.

For example, an image classifier trained on animal labels may learn features for fur, eyes, ears, body shape, and background context. These features emerge because they help predict labels.

Supervised representations are often highly useful for the training task. They may transfer to related tasks, but they can also become too specialized. If the labels are narrow, the learned representation may ignore useful information outside the labeled objective.

Self-Supervised Representation Learning

Self-supervised learning creates training signals from the data itself. Instead of requiring human labels, the model predicts missing, transformed, or related parts of the input.

Examples include:

Data type	Self-supervised task
Text	Predict masked or next tokens
Images	Predict missing patches
Audio	Predict masked frames
Video	Predict future frames
Graphs	Predict missing edges or attributes

The model learns representations because solving the pretext task requires understanding structure in the data.

A masked prediction objective can be written as

L = -\sum_{i \in M} \log p_\theta(x_i \mid x_{\setminus M}),

where $M$ is the set of masked positions and $x_{\setminus M}$ is the visible context.

This objective is central to many modern foundation models. It allows models to learn from large unlabeled datasets.

Contrastive Representation Learning

Contrastive learning trains representations by comparing examples. The model pulls related examples closer and pushes unrelated examples apart.

Suppose $x$ and $x^+$ are two augmented views of the same example, while $x^-_1,\ldots,x^-_K$ are unrelated examples. Let

h = f_\theta(x), \quad h^+ = f_\theta(x^+), \quad h^-_k = f_\theta(x^-_k).

A contrastive loss encourages

\operatorname{sim}(h,h^+)

to be larger than

\operatorname{sim}(h,h^-_k).

A common objective is InfoNCE:

L = -\log \frac{ \exp(\operatorname{sim}(h,h^+)/\tau) }{ \exp(\operatorname{sim}(h,h^+)/\tau) + \sum_{k=1}^K \exp(\operatorname{sim}(h,h^-_k)/\tau) }.

Here $\tau$ is a temperature parameter.

Contrastive learning is useful when meaningful invariances can be specified through data augmentation. For images, two crops of the same image should often have similar representations. For text, two paraphrases should be close. For audio, two views of the same utterance should be close.

Embeddings

An embedding is a learned vector representation of a discrete or structured object. Words, tokens, users, items, graph nodes, documents, images, and queries can all be embedded.

For token embeddings, an embedding table has shape

E \in \mathbb{R}^{V \times d},

where $V$ is the vocabulary size and $d$ is the embedding dimension. A token ID selects one row:

h = E_i.

In PyTorch:

import torch
from torch import nn

embedding = nn.Embedding(num_embeddings=50000, embedding_dim=768)

token_ids = torch.tensor([101, 2047, 2003, 102])
h = embedding(token_ids)

print(h.shape)  # torch.Size([4, 768])

Embeddings turn discrete IDs into dense vectors that can be optimized by gradient descent.

Representation Learning in Autoencoders

Autoencoders learn representations by reconstruction. The encoder produces a latent code:

z = f_\theta(x),

and the decoder reconstructs:

\hat{x} = g_\phi(z).

The representation is useful if $z$ preserves the important information needed to recover $x$ .

Different autoencoder variants impose different pressures:

Model	Representation pressure
Undercomplete autoencoder	Compress through a small bottleneck
Sparse autoencoder	Use few active latent units
Denoising autoencoder	Preserve stable structure under corruption
Variational autoencoder	Organize latents according to a prior

These pressures shape what the representation stores.

Transfer Learning

A representation learned on one dataset can often be reused on another. This is transfer learning.

Let a pretrained model produce representations

h = f_\theta(x).

For a new task, we can train a small head

\hat{y} = c_\psi(h)

while keeping $f_\theta$ fixed, or fine-tune both $f_\theta$ and $c_\psi$ .

Frozen representations are useful when the new dataset is small. Fine-tuning is useful when the new task differs enough from pretraining that the representation should adapt.

Transfer learning works when the pretrained representation captures general structure shared across tasks.

Linear Probing

A linear probe tests whether information is easily accessible in a representation. We freeze the representation model and train only a linear classifier:

\hat{y} = \operatorname{softmax}(Wh + b).

If the linear probe performs well, the representation already makes the target information linearly separable.

Linear probing is diagnostic. It does not prove that the model uses the information in the same way during its original task. It only shows that the information can be extracted simply.

In PyTorch:

class LinearProbe(nn.Module):
    def __init__(self, representation_dim: int, num_classes: int):
        super().__init__()
        self.classifier = nn.Linear(representation_dim, num_classes)

    def forward(self, h: torch.Tensor) -> torch.Tensor:
        return self.classifier(h)

Representation Collapse

Representation collapse occurs when many inputs map to the same or nearly the same representation:

f_\theta(x_i) \approx f_\theta(x_j)

for many different examples.

Collapse is a failure because the representation loses information. It can occur in self-supervised or contrastive systems when the objective allows trivial solutions.

For example, if all inputs map to the zero vector, some similarity objectives may be minimized unless the training design prevents collapse.

Common anti-collapse mechanisms include:

Method	Purpose
Negative examples	Push unrelated examples apart
Stop-gradient branches	Stabilize asymmetric objectives
Variance regularization	Prevent constant embeddings
Whitening or decorrelation	Spread information across dimensions
Predictor heads	Avoid degenerate fixed points

A good representation learning objective must reward useful invariance without allowing all examples to become indistinguishable.

Invariance and Equivariance

A representation is invariant to a transformation if it stays the same after the transformation.

If $T$ is a transformation, invariance means

f(Tx) = f(x).

For example, an image classifier should often be mostly invariant to small translations or changes in brightness.

A representation is equivariant if it changes in a predictable way:

f(Tx) = T' f(x).

For example, in a segmentation model, shifting the input image should shift the output mask. The representation should not ignore the shift; it should preserve its structure.

Invariance is useful for classification. Equivariance is useful for localization, geometry, control, and structured prediction.

Evaluation of Representations

Representations can be evaluated in several ways.

Evaluation method	What it measures
Linear probing	Whether labels are linearly accessible
Fine-tuning	Adaptability to downstream tasks
Nearest neighbors	Semantic structure of embedding space
Clustering	Grouping quality
Retrieval metrics	Search usefulness
Reconstruction	Information preservation
Robustness tests	Stability under distribution shift
Calibration	Reliability of predictions

No single metric is sufficient. A representation that performs well for classification may perform poorly for retrieval. A representation that reconstructs well may encode nuisance details. Evaluation should match the intended use.

Representation Learning Across Domains

Representation learning appears differently across domains.

In vision, representations often organize images by objects, scenes, texture, shape, and spatial layout.

In language, representations encode syntax, semantics, entities, discourse relations, and task instructions.

In audio, representations capture pitch, speaker identity, phonemes, rhythm, and acoustic environment.

In graphs, representations encode node attributes, edge structure, community membership, and relational patterns.

In reinforcement learning, representations encode state information useful for predicting future reward and choosing actions.

The mathematical idea remains the same: transform raw observations into a space where useful structure is easier to model.

Summary

Representation learning is the central mechanism by which deep networks become useful. A model transforms raw inputs into internal vectors that expose structure relevant to a task.

Autoencoders learn representations through reconstruction. Supervised models learn them through labels. Self-supervised models learn them from prediction tasks derived from raw data. Contrastive models learn them through similarity and difference.

A good representation preserves useful information, removes nuisance variation, supports transfer, and makes downstream tasks easier. In modern deep learning, much of the power of large models comes from learning representations that generalize across many tasks and domains.