Representation learning is the study of how models learn useful internal descriptions of data.
Representation learning is the study of how models learn useful internal descriptions of data. Instead of relying only on hand-designed features, a neural network learns features from examples. These learned features are then used for prediction, reconstruction, retrieval, generation, or control.
An input may begin as raw data:
A model maps it to a representation:
where
The vector may be called a feature vector, embedding, hidden state, code, or representation. The name depends on the context, but the role is the same: it is a transformed version of the input that should make useful structure easier to access.
What Makes a Representation Useful?
A useful representation preserves information that matters and removes information that does not. The relevant information depends on the task.
For image classification, a useful representation should preserve object identity and discard small changes in lighting, position, or background. For speech recognition, it should preserve phonetic content and reduce irrelevant speaker or microphone variation. For retrieval, it should place semantically similar items near each other. For generation, it should preserve enough information to reconstruct or synthesize data.
A good representation often has several properties:
| Property | Meaning |
|---|---|
| Compactness | Uses fewer or more organized coordinates |
| Invariance | Ignores nuisance variation |
| Equivariance | Changes predictably under transformations |
| Separability | Makes classes or concepts easier to distinguish |
| Smoothness | Nearby representations decode or behave similarly |
| Compositionality | Combines simpler factors into richer meanings |
No single property is always best. The right representation depends on the objective.
Hand-Designed Features Versus Learned Features
Before deep learning became dominant, many systems depended on hand-designed features. In vision, engineers designed edge detectors, texture descriptors, and shape features. In language, systems used word counts, part-of-speech tags, syntactic patterns, and dictionaries. In speech, systems used spectral features such as MFCCs.
Deep learning changes this pattern. The model learns multiple layers of features jointly with the task objective.
A simple classifier can be written as
where is the representation learner and is the prediction head.
Training adjusts both and , so the representation becomes useful for prediction.
This joint learning is one reason deep networks perform well on raw or lightly processed data. The features adapt to the data distribution and the task.
Layered Representations
Deep networks build representations in layers. Each layer transforms the previous representation:
Early layers often capture local or low-level patterns. Later layers often capture more abstract structure.
In a convolutional network for images, early layers may detect edges and textures. Middle layers may detect parts. Later layers may detect object-level patterns.
In a transformer language model, early layers may capture token identity and local syntax. Middle layers may capture phrases, entities, and dependencies. Later layers may capture task-relevant abstractions and output-oriented features.
This hierarchy is not perfectly clean, but it is a useful organizing principle.
Supervised Representation Learning
In supervised learning, representations are learned from labeled examples:
The model is trained to predict from . The loss may be cross-entropy for classification:
The hidden layers learn features that help reduce prediction error.
For example, an image classifier trained on animal labels may learn features for fur, eyes, ears, body shape, and background context. These features emerge because they help predict labels.
Supervised representations are often highly useful for the training task. They may transfer to related tasks, but they can also become too specialized. If the labels are narrow, the learned representation may ignore useful information outside the labeled objective.
Self-Supervised Representation Learning
Self-supervised learning creates training signals from the data itself. Instead of requiring human labels, the model predicts missing, transformed, or related parts of the input.
Examples include:
| Data type | Self-supervised task |
|---|---|
| Text | Predict masked or next tokens |
| Images | Predict missing patches |
| Audio | Predict masked frames |
| Video | Predict future frames |
| Graphs | Predict missing edges or attributes |
The model learns representations because solving the pretext task requires understanding structure in the data.
A masked prediction objective can be written as
where is the set of masked positions and is the visible context.
This objective is central to many modern foundation models. It allows models to learn from large unlabeled datasets.
Contrastive Representation Learning
Contrastive learning trains representations by comparing examples. The model pulls related examples closer and pushes unrelated examples apart.
Suppose and are two augmented views of the same example, while are unrelated examples. Let
A contrastive loss encourages
to be larger than
A common objective is InfoNCE:
Here is a temperature parameter.
Contrastive learning is useful when meaningful invariances can be specified through data augmentation. For images, two crops of the same image should often have similar representations. For text, two paraphrases should be close. For audio, two views of the same utterance should be close.
Embeddings
An embedding is a learned vector representation of a discrete or structured object. Words, tokens, users, items, graph nodes, documents, images, and queries can all be embedded.
For token embeddings, an embedding table has shape
where is the vocabulary size and is the embedding dimension. A token ID selects one row:
In PyTorch:
import torch
from torch import nn
embedding = nn.Embedding(num_embeddings=50000, embedding_dim=768)
token_ids = torch.tensor([101, 2047, 2003, 102])
h = embedding(token_ids)
print(h.shape) # torch.Size([4, 768])Embeddings turn discrete IDs into dense vectors that can be optimized by gradient descent.
Representation Learning in Autoencoders
Autoencoders learn representations by reconstruction. The encoder produces a latent code:
and the decoder reconstructs:
The representation is useful if preserves the important information needed to recover .
Different autoencoder variants impose different pressures:
| Model | Representation pressure |
|---|---|
| Undercomplete autoencoder | Compress through a small bottleneck |
| Sparse autoencoder | Use few active latent units |
| Denoising autoencoder | Preserve stable structure under corruption |
| Variational autoencoder | Organize latents according to a prior |
These pressures shape what the representation stores.
Transfer Learning
A representation learned on one dataset can often be reused on another. This is transfer learning.
Let a pretrained model produce representations
For a new task, we can train a small head
while keeping fixed, or fine-tune both and .
Frozen representations are useful when the new dataset is small. Fine-tuning is useful when the new task differs enough from pretraining that the representation should adapt.
Transfer learning works when the pretrained representation captures general structure shared across tasks.
Linear Probing
A linear probe tests whether information is easily accessible in a representation. We freeze the representation model and train only a linear classifier:
If the linear probe performs well, the representation already makes the target information linearly separable.
Linear probing is diagnostic. It does not prove that the model uses the information in the same way during its original task. It only shows that the information can be extracted simply.
In PyTorch:
class LinearProbe(nn.Module):
def __init__(self, representation_dim: int, num_classes: int):
super().__init__()
self.classifier = nn.Linear(representation_dim, num_classes)
def forward(self, h: torch.Tensor) -> torch.Tensor:
return self.classifier(h)Representation Collapse
Representation collapse occurs when many inputs map to the same or nearly the same representation:
for many different examples.
Collapse is a failure because the representation loses information. It can occur in self-supervised or contrastive systems when the objective allows trivial solutions.
For example, if all inputs map to the zero vector, some similarity objectives may be minimized unless the training design prevents collapse.
Common anti-collapse mechanisms include:
| Method | Purpose |
|---|---|
| Negative examples | Push unrelated examples apart |
| Stop-gradient branches | Stabilize asymmetric objectives |
| Variance regularization | Prevent constant embeddings |
| Whitening or decorrelation | Spread information across dimensions |
| Predictor heads | Avoid degenerate fixed points |
A good representation learning objective must reward useful invariance without allowing all examples to become indistinguishable.
Invariance and Equivariance
A representation is invariant to a transformation if it stays the same after the transformation.
If is a transformation, invariance means
For example, an image classifier should often be mostly invariant to small translations or changes in brightness.
A representation is equivariant if it changes in a predictable way:
For example, in a segmentation model, shifting the input image should shift the output mask. The representation should not ignore the shift; it should preserve its structure.
Invariance is useful for classification. Equivariance is useful for localization, geometry, control, and structured prediction.
Evaluation of Representations
Representations can be evaluated in several ways.
| Evaluation method | What it measures |
|---|---|
| Linear probing | Whether labels are linearly accessible |
| Fine-tuning | Adaptability to downstream tasks |
| Nearest neighbors | Semantic structure of embedding space |
| Clustering | Grouping quality |
| Retrieval metrics | Search usefulness |
| Reconstruction | Information preservation |
| Robustness tests | Stability under distribution shift |
| Calibration | Reliability of predictions |
No single metric is sufficient. A representation that performs well for classification may perform poorly for retrieval. A representation that reconstructs well may encode nuisance details. Evaluation should match the intended use.
Representation Learning Across Domains
Representation learning appears differently across domains.
In vision, representations often organize images by objects, scenes, texture, shape, and spatial layout.
In language, representations encode syntax, semantics, entities, discourse relations, and task instructions.
In audio, representations capture pitch, speaker identity, phonemes, rhythm, and acoustic environment.
In graphs, representations encode node attributes, edge structure, community membership, and relational patterns.
In reinforcement learning, representations encode state information useful for predicting future reward and choosing actions.
The mathematical idea remains the same: transform raw observations into a space where useful structure is easier to model.
Summary
Representation learning is the central mechanism by which deep networks become useful. A model transforms raw inputs into internal vectors that expose structure relevant to a task.
Autoencoders learn representations through reconstruction. Supervised models learn them through labels. Self-supervised models learn them from prediction tasks derived from raw data. Contrastive models learn them through similarity and difference.
A good representation preserves useful information, removes nuisance variation, supports transfer, and makes downstream tasks easier. In modern deep learning, much of the power of large models comes from learning representations that generalize across many tasks and domains.