Unsupervised Learning

Unsupervised learning studies data without explicit target labels. The dataset contains inputs only:

\mathcal{D} = \{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}.

There is no given $y$ . The model must discover useful structure in the data itself.

Unsupervised learning is used for clustering, dimensionality reduction, density estimation, anomaly detection, representation learning, and generative modeling.

The Goal of Unsupervised Learning

In supervised learning, the target tells the model what to predict. In unsupervised learning, the target is implicit.

The model may try to learn:

Goal	Meaning
Clusters	Which examples are similar
Representations	Useful hidden features
Low-dimensional structure	A compact version of the data
Probability density	How likely an example is
Generative structure	How to produce new examples
Anomalies	Which examples are unusual

For example, given many images without labels, an unsupervised method may learn that images contain edges, textures, shapes, parts, and object-like regions. No human label says “edge” or “object.” The structure comes from the data distribution.

Clustering

Clustering groups similar examples together.

Given data points

x^{(1)}, x^{(2)}, \dots, x^{(N)},

a clustering algorithm assigns each point to a cluster:

c^{(i)} \in \{1, 2, \dots, K\}.

The classic example is $k$ -means clustering. It learns $K$ cluster centers:

\mu_1, \mu_2, \dots, \mu_K.

Each data point is assigned to the nearest center:

c^{(i)} = \arg\min_k \|x^{(i)} - \mu_k\|^2.

The objective is

\sum_{i=1}^{N} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2.

In deep learning, clustering is often applied to learned embeddings rather than raw data. For example, a neural network may first map an image to an embedding vector, and clustering may then group embeddings by visual similarity.

Dimensionality Reduction

Dimensionality reduction maps high-dimensional data into a lower-dimensional space.

Suppose

x \in \mathbb{R}^{D}

and $D$ is large. We want a lower-dimensional representation

z \in \mathbb{R}^{d}, \quad d < D.

The mapping is usually written as

z = f(x).

The goal is to preserve important structure while removing redundancy or noise.

Principal component analysis, or PCA, is the classical linear method. Autoencoders are the deep learning version.

An autoencoder has two parts:

z = f_\theta(x)

and

\hat{x} = g_\phi(z).

The encoder $f_\theta$ maps the input to a compact representation. The decoder $g_\phi$ reconstructs the input from that representation.

The reconstruction loss is often

L(x,\hat{x}) = \|x - \hat{x}\|^2.

In PyTorch:

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, hidden_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

model = Autoencoder(input_dim=784, hidden_dim=32)

x = torch.randn(64, 784)
x_hat, z = model(x)

loss = ((x - x_hat) ** 2).mean()

print(z.shape)     # torch.Size([64, 32])
print(loss.shape)  # torch.Size([])

Here the model compresses 784-dimensional inputs into 32-dimensional representations.

Representation Learning

Representation learning is the process of learning useful features from data.

A representation is a transformed version of an input:

z = f_\theta(x).

The representation $z$ should keep information that matters and discard information that does not.

For example:

Input	Useful representation may capture
Image	Shapes, textures, object parts
Sentence	Meaning, syntax, entities
Audio	Phonemes, speaker traits, rhythm
Graph	Node roles, communities, connectivity
User behavior	Preferences, intent, habits

Deep learning is powerful because it can learn representations instead of relying only on manually designed features.

In modern systems, representations are often learned on large unlabeled datasets and reused for supervised tasks. A model trained on unlabeled text may learn embeddings useful for classification, retrieval, question answering, and summarization.

Density Estimation

Density estimation tries to learn the probability distribution that generated the data.

The goal is to model

p_\theta(x).

If the model assigns high probability to realistic examples and low probability to unrealistic examples, it has learned something about the data distribution.

Density estimation is central to generative modeling. A language model, for example, estimates the probability of a token sequence:

p_\theta(x_1, x_2, \dots, x_T).

Using the chain rule of probability, this can be written as

p_\theta(x_1, \dots, x_T) = \prod_{t=1}^{T} p_\theta(x_t \mid x_1, \dots, x_{t-1}).

This objective uses no external human label. The sequence itself provides the training signal.

Generative Modeling

A generative model learns to produce new data that resembles the training data.

Examples include:

Data type	Generated output
Text	Articles, code, answers
Images	Photorealistic pictures
Audio	Speech or music
Video	Motion sequences
Molecules	Candidate chemical structures

A generative model may learn either an explicit probability distribution or an implicit sampling process.

Important families include:

Model family	Core idea
Autoregressive models	Generate one element at a time
Variational autoencoders	Learn latent variables
Generative adversarial networks	Train generator against discriminator
Normalizing flows	Learn invertible transformations
Diffusion models	Learn to reverse a noise process

Generative modeling is often unsupervised because training examples do not require external labels. The model learns from the structure of the data itself.

Anomaly Detection

Anomaly detection identifies examples that differ from normal data.

A model is trained on ordinary examples. At inference time, unusual examples receive high anomaly scores.

For an autoencoder, one simple anomaly score is reconstruction error:

s(x) = \|x - \hat{x}\|^2.

If the model reconstructs normal examples well but reconstructs unusual examples poorly, then high reconstruction error indicates an anomaly.

Applications include:

Domain	Anomaly
Cybersecurity	Suspicious network behavior
Manufacturing	Defective parts
Finance	Fraudulent transactions
Medicine	Unusual scans
Infrastructure	Sensor failures

Anomaly detection is difficult because anomalies are rare and diverse. The model may see many examples of normal behavior but few examples of failure.

Unsupervised Learning in PyTorch

A basic unsupervised training loop looks similar to supervised training. The difference is that there may be no external label.

For an autoencoder:

for x_batch in dataloader:
    optimizer.zero_grad()

    x_hat, z = model(x_batch)

    loss = ((x_batch - x_hat) ** 2).mean()

    loss.backward()
    optimizer.step()

The input itself acts as the target. This pattern appears in many unsupervised models.

For contrastive or self-supervised methods, the training loop may create artificial views of the same input:

for x_batch in dataloader:
    x1 = augment(x_batch)
    x2 = augment(x_batch)

    z1 = encoder(x1)
    z2 = encoder(x2)

    loss = contrastive_loss(z1, z2)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

The dataset still contains only $x$ , but the training procedure constructs a learning signal from transformations of $x$ .

Unsupervised Versus Supervised Learning

The practical difference is the source of the training signal.

Property	Supervised learning	Unsupervised learning
Data	Inputs and targets	Inputs only
Example	$(x, y)$	$x$
Objective	Predict labels or values	Discover structure
Common tasks	Classification, regression	Clustering, compression, generation
Cost	Often needs labels	Can use unlabeled data
Risk	Label bias, overfitting	Harder evaluation

Unsupervised learning can use much larger datasets because unlabeled data is abundant. This makes it important for modern deep learning, where large-scale pretraining often depends on weak, implicit, or self-generated training signals.

Limitations

Unsupervised learning has several limitations.

First, the objective may not match the final task. A model may learn structure that is mathematically valid but practically useless.

Second, evaluation is harder. In classification, accuracy is easy to measure. In unsupervised learning, there may be no single correct answer.

Third, learned representations may encode unwanted biases from the data.

Fourth, generative models may learn to imitate surface statistics without learning deeper causal structure.

Finally, unsupervised methods often require large datasets and careful objective design.

Summary

Unsupervised learning learns from inputs without explicit labels. It aims to discover structure in the data distribution.

The main tasks include clustering, dimensionality reduction, representation learning, density estimation, generative modeling, and anomaly detection.

In deep learning, unsupervised learning is important because it can use large unlabeled datasets. Autoencoders, language models, diffusion models, and contrastive systems all rely on the idea that useful training signals can be extracted from the data itself.