# Representation Learning

Representation learning is the study of how a model converts raw data into useful internal variables. In an autoencoder, the representation is the latent code $z$. More generally, it may be a hidden state, embedding vector, feature map, memory state, graph embedding, or sequence of token representations.

The central question is simple: what information should the model keep, and in what form should it keep it?

A raw input often contains more detail than a downstream task needs. An image contains exact pixels, lighting, compression artifacts, sensor noise, and background objects. A language input contains token identities, syntax, topic, style, speaker intent, and formatting. A good representation preserves useful structure while reducing irrelevant variation.

### From Features to Learned Representations

Classical machine learning often relied on hand-designed features. A practitioner would decide which measurements to compute from the raw input. For images, these might include edges, corners, color histograms, or texture descriptors. For text, they might include word counts, n-grams, or syntactic features.

Deep learning changes this workflow. The model learns features from data. Early layers often learn local patterns. Later layers combine these patterns into more abstract representations.

For an input $x$, a neural network computes a sequence of representations:

$$
h_1 = f_1(x),
\qquad
h_2 = f_2(h_1),
\qquad
\cdots
\qquad
h_L = f_L(h_{L-1}).
$$

Each layer transforms the previous representation. In a classifier, the final representation $h_L$ is used to predict a label. In an autoencoder, the latent representation $z$ is used to reconstruct the input. In a language model, token representations are used to predict future or masked tokens.

The important point is that representations are shaped by the objective. A model trained for reconstruction learns information useful for reconstruction. A model trained for classification learns information useful for classification. A model trained for contrastive learning learns information useful for distinguishing related from unrelated examples.

### What Makes a Good Representation

A good representation depends on the task, but several properties are common.

A representation should be informative. It should preserve the variables needed for prediction, reconstruction, retrieval, generation, or control.

A representation should remove nuisance variation. Small changes in lighting, noise, formatting, or sensor conditions should not strongly change the representation when those changes are irrelevant.

A representation should organize similar examples near each other when similarity matters. This is especially important for retrieval, clustering, and nearest-neighbor search.

A representation should support simple downstream models. If a linear classifier can solve a task using the representation, then the representation has made the task geometrically simple.

A representation should generalize. It should work on unseen data drawn from the same or related distributions.

A representation should be stable. Small input perturbations should not cause arbitrary changes in representation unless the perturbation changes the meaning of the input.

These criteria can conflict. A representation that is highly invariant may discard information needed for reconstruction. A representation that preserves every detail may contain too much nuisance variation for classification.

### Invariance and Equivariance

Two core ideas in representation learning are invariance and equivariance.

A representation is invariant to a transformation if the representation stays the same when the input is transformed.

Let $T$ be a transformation, such as a small image translation. A representation $f$ is invariant to $T$ if

$$
f(Tx) = f(x).
$$

For example, an image classifier should often recognize a cat whether the cat is slightly left or right in the image. The class representation should be approximately invariant to small translations.

A representation is equivariant to a transformation if the representation changes in a predictable way when the input is transformed. If $T$ transforms the input and $S_T$ is the corresponding transformation in representation space, then

$$
f(Tx) = S_T f(x).
$$

For example, in a segmentation model, shifting the image should shift the segmentation map. The output should not stay identical; it should move with the object.

Invariance is useful for decisions that should ignore a transformation. Equivariance is useful when the output should preserve spatial, temporal, or structural relationships.

### Hierarchical Representations

Deep networks learn hierarchical representations. Lower layers tend to represent local and simple patterns. Higher layers tend to represent larger and more abstract patterns.

In vision models, early layers may detect edges and color contrasts. Middle layers may detect textures and object parts. Later layers may represent object identity or scene structure.

In language models, early layers may encode lexical and positional information. Middle layers may encode syntactic patterns. Later layers may encode semantic, discourse, or task-specific information.

In audio models, early layers may encode local frequency patterns. Later layers may encode phonemes, words, speaker identity, or acoustic events.

This hierarchy arises because each layer builds on the previous one. A deep model composes simple functions into complex functions.

For a network with layers

$$
h_l = f_l(h_{l-1}),
$$

the representation $h_l$ depends on all earlier transformations. Depth gives the model a way to build representations at multiple levels of abstraction.

### Linear Separability

A common test for representation quality is linear separability.

Suppose a representation function maps inputs to vectors:

$$
z_i = f_\theta(x_i).
$$

A task is linearly separable in representation space if a linear classifier can separate the classes:

$$
\hat{y} =
\operatorname{softmax}(Wz + b).
$$

If the learned representation makes classes linearly separable, the downstream classifier can be simple.

This is one reason deep learning works well. The deep network transforms raw data into a space where simple decision rules become effective. The final classifier may be linear, but the representation before it is highly nonlinear.

A linear probe is a common diagnostic. We freeze the representation model and train only a linear classifier on top. Strong linear-probe performance suggests that the representation already contains task-relevant information in an accessible form.

### Autoencoders as Representation Learners

Autoencoders learn representations through reconstruction. The encoder compresses the input:

$$
z = f_\theta(x),
$$

and the decoder reconstructs it:

$$
\hat{x} = g_\phi(z).
$$

The latent code must contain information needed to rebuild $x$. This can produce useful features, especially when the bottleneck, architecture, or corruption process prevents trivial copying.

Different autoencoder variants shape the representation differently.

| Autoencoder type | Representation pressure |
|---|---|
| Undercomplete autoencoder | Limited latent dimension |
| Sparse autoencoder | Few active features |
| Denoising autoencoder | Robustness to corruption |
| Variational autoencoder | Smooth probabilistic latent space |
| Vector-quantized autoencoder | Discrete latent codes |

A plain autoencoder trained only with pixel reconstruction may learn low-level features. A denoising or variational autoencoder may learn smoother and more robust representations. A sparse autoencoder may learn more interpretable feature directions.

### Supervised Representation Learning

In supervised learning, labels shape the representation. For classification, the model learns features that separate classes.

Given examples $(x_i,y_i)$, the network produces logits

$$
s_i = W f_\theta(x_i) + b.
$$

The cross-entropy loss encourages the representation $f_\theta(x_i)$ to contain information useful for predicting $y_i$.

Supervised representations can be highly effective, but they inherit limitations from the labels. If labels are coarse, the model may discard information not needed for those labels. If labels contain bias, the representation may encode that bias. If the dataset is small, the representation may overfit.

For example, a classifier trained only to distinguish dog breeds may learn strong dog-specific features but weak representations for scenes, tools, or human actions.

### Self-Supervised Representation Learning

Self-supervised learning creates training signals from the data itself. Instead of requiring human labels, it defines a prediction problem using the structure of the input.

Examples include:

| Domain | Self-supervised task |
|---|---|
| Text | Predict masked or next tokens |
| Images | Predict missing patches or match augmented views |
| Audio | Predict masked acoustic frames |
| Video | Predict future frames or temporal order |
| Graphs | Predict masked nodes, edges, or subgraphs |

The model learns representations because solving the pretext task requires understanding structure in the data.

Masked language modeling is a denoising objective. Autoregressive language modeling is a next-token prediction objective. Contrastive image learning makes augmented views of the same image close in representation space and different images farther apart.

Self-supervised learning has become central because it scales to large unlabeled datasets.

### Contrastive Representation Learning

Contrastive learning trains representations by comparing examples.

The model receives positive pairs and negative pairs. A positive pair contains two related views of the same underlying example. A negative pair contains unrelated examples.

For an anchor representation $z_i$, a positive representation $z_i^+$, and negatives $z_j^-$, the objective encourages

$$
\operatorname{sim}(z_i,z_i^+)
$$

to be larger than

$$
\operatorname{sim}(z_i,z_j^-).
$$

A common loss is InfoNCE:

$$
L_i =
-\log
\frac{
\exp(\operatorname{sim}(z_i,z_i^+)/\tau)
}{
\exp(\operatorname{sim}(z_i,z_i^+)/\tau)
+
\sum_j
\exp(\operatorname{sim}(z_i,z_j^-)/\tau)
}.
$$

Here $\tau$ is a temperature parameter.

Contrastive learning is effective when good data augmentations are available. For images, augmentations may include cropping, color jitter, blur, and flipping. For text, augmentations are more difficult because small token changes can alter meaning.

### Non-Contrastive Self-Supervised Learning

Some self-supervised methods learn without explicit negative examples. They train two networks or two views of the same input and encourage their representations to match.

The risk is collapse. Collapse occurs when the model maps all inputs to the same representation. A constant representation gives perfect agreement but contains no useful information.

Non-contrastive methods avoid collapse through architectural asymmetry, stop-gradient operations, prediction heads, variance regularization, redundancy reduction, or teacher-student updates.

The general goal is to learn representations that are invariant to harmless augmentations while retaining enough diversity to distinguish different examples.

### Embeddings

An embedding is a representation vector for a discrete or structured object. Words, tokens, documents, users, products, images, proteins, and graph nodes can all have embeddings.

For a vocabulary of size $V$, an embedding table is a matrix

$$
E \in \mathbb{R}^{V \times d}.
$$

The token with index $i$ has embedding

$$
E_i \in \mathbb{R}^d.
$$

In PyTorch:

```python
import torch
from torch import nn

embedding = nn.Embedding(num_embeddings=50_000, embedding_dim=768)

token_ids = torch.tensor([[12, 104, 91], [87, 3, 502]])
x = embedding(token_ids)

print(x.shape)  # torch.Size([2, 3, 768])
```

Embeddings turn discrete IDs into continuous vectors. This allows gradient-based optimization to learn semantic or functional similarity.

### Representation Geometry

Representation geometry studies distances, angles, clusters, and directions in embedding space.

Two common similarity measures are Euclidean distance and cosine similarity:

$$
d(z_i,z_j)=\|z_i-z_j\|_2,
$$

$$
\operatorname{cos}(z_i,z_j) =
\frac{z_i^\top z_j}{\|z_i\|_2\|z_j\|_2}.
$$

Cosine similarity is often used for embeddings because it compares direction rather than magnitude.

Representation spaces may contain meaningful directions. For example, a direction may correspond to sentiment, tense, formality, image brightness, or object pose. However, such directions are empirical properties of the learned model. They should be validated rather than assumed.

### Representation Collapse

Representation collapse occurs when many inputs map to identical or nearly identical vectors.

In extreme collapse,

$$
f_\theta(x_i) = c
$$

for all $i$, where $c$ is a constant vector.

A collapsed representation is useless for discrimination because it removes differences between examples.

Collapse can happen in poorly designed self-supervised learning systems, excessive regularization, overly strong invariance objectives, or degenerate autoencoder training.

Diagnostics include:

- very low variance across embeddings,
- near-identical pairwise similarities,
- poor retrieval performance,
- poor linear-probe accuracy,
- low effective rank of the embedding matrix.

A practical check is to compute per-dimension standard deviation:

```python
@torch.no_grad()
def embedding_statistics(z):
    return {
        "mean_norm": z.norm(dim=1).mean().item(),
        "mean_std": z.std(dim=0).mean().item(),
        "min_std": z.std(dim=0).min().item(),
        "max_std": z.std(dim=0).max().item(),
    }
```

If `mean_std` is close to zero, the representation may have collapsed.

### Evaluation of Representations

Representations should be evaluated according to their intended use.

For classification, train a linear probe or a small supervised head. For retrieval, measure recall at $k$, mean reciprocal rank, or normalized discounted cumulative gain. For clustering, compare clusters with labels using adjusted Rand index or normalized mutual information. For generation, evaluate decoded sample quality and latent interpolation. For transfer learning, fine-tune the model on a new dataset and measure performance.

| Evaluation method | What it measures |
|---|---|
| Linear probe | Accessibility of label information |
| k-nearest neighbors | Local semantic structure |
| Retrieval metrics | Ranking quality |
| Clustering metrics | Group structure |
| Transfer learning | General usefulness |
| Robustness tests | Stability under perturbation |
| Interpolation | Latent continuity |
| Visualization | Qualitative structure |

No single metric fully characterizes a representation. A representation may be excellent for retrieval and poor for reconstruction, or strong for classification and weak for generation.

### PyTorch Example: Learning an Embedding Model

The following example trains a small representation model with a classification head. The representation is the vector before the final classifier.

```python
import torch
from torch import nn

class RepresentationNet(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, repr_dim: int, num_classes: int):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, repr_dim),
            nn.ReLU(),
        )

        self.classifier = nn.Linear(repr_dim, num_classes)

    def forward(self, x, return_repr: bool = False):
        z = self.encoder(x)
        logits = self.classifier(z)

        if return_repr:
            return logits, z

        return logits
```

Training step:

```python
model = RepresentationNet(
    input_dim=784,
    hidden_dim=512,
    repr_dim=128,
    num_classes=10,
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

x = torch.randn(64, 784)
y = torch.randint(0, 10, (64,))

logits, z = model(x, return_repr=True)
loss = loss_fn(logits, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

print(z.shape)  # torch.Size([64, 128])
```

After training, the classifier can be removed and the encoder can be used as a feature extractor:

```python
@torch.no_grad()
def embed(model, x):
    _, z = model(x, return_repr=True)
    return z
```

This pattern is common in transfer learning. A pretrained encoder becomes a general representation function for downstream tasks.

### Summary

Representation learning replaces hand-designed features with learned internal variables. These variables may be latent codes, embeddings, hidden states, or feature maps.

Autoencoders learn representations by reconstructing inputs. Supervised models learn representations by predicting labels. Self-supervised models learn representations by solving prediction tasks derived from the data itself. Contrastive and masked objectives are two major families of self-supervised representation learning.

A useful representation preserves task-relevant information, removes nuisance variation, organizes similar examples, supports simple downstream models, and generalizes beyond the training data.