Skip to content

Unsupervised Learning

Unsupervised learning studies data without explicit target labels. The dataset contains inputs only:

Unsupervised learning studies data without explicit target labels. The dataset contains inputs only:

D={x(1),x(2),,x(N)}. \mathcal{D} = \{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}.

There is no given yy. The model must discover useful structure in the data itself.

Unsupervised learning is used for clustering, dimensionality reduction, density estimation, anomaly detection, representation learning, and generative modeling.

The Goal of Unsupervised Learning

In supervised learning, the target tells the model what to predict. In unsupervised learning, the target is implicit.

The model may try to learn:

GoalMeaning
ClustersWhich examples are similar
RepresentationsUseful hidden features
Low-dimensional structureA compact version of the data
Probability densityHow likely an example is
Generative structureHow to produce new examples
AnomaliesWhich examples are unusual

For example, given many images without labels, an unsupervised method may learn that images contain edges, textures, shapes, parts, and object-like regions. No human label says “edge” or “object.” The structure comes from the data distribution.

Clustering

Clustering groups similar examples together.

Given data points

x(1),x(2),,x(N), x^{(1)}, x^{(2)}, \dots, x^{(N)},

a clustering algorithm assigns each point to a cluster:

c(i){1,2,,K}. c^{(i)} \in \{1, 2, \dots, K\}.

The classic example is kk-means clustering. It learns KK cluster centers:

μ1,μ2,,μK. \mu_1, \mu_2, \dots, \mu_K.

Each data point is assigned to the nearest center:

c(i)=argminkx(i)μk2. c^{(i)} = \arg\min_k \|x^{(i)} - \mu_k\|^2.

The objective is

i=1Nx(i)μc(i)2. \sum_{i=1}^{N} \left\| x^{(i)} - \mu_{c^{(i)}} \right\|^2.

In deep learning, clustering is often applied to learned embeddings rather than raw data. For example, a neural network may first map an image to an embedding vector, and clustering may then group embeddings by visual similarity.

Dimensionality Reduction

Dimensionality reduction maps high-dimensional data into a lower-dimensional space.

Suppose

xRD x \in \mathbb{R}^{D}

and DD is large. We want a lower-dimensional representation

zRd,d<D. z \in \mathbb{R}^{d}, \quad d < D.

The mapping is usually written as

z=f(x). z = f(x).

The goal is to preserve important structure while removing redundancy or noise.

Principal component analysis, or PCA, is the classical linear method. Autoencoders are the deep learning version.

An autoencoder has two parts:

z=fθ(x) z = f_\theta(x)

and

x^=gϕ(z). \hat{x} = g_\phi(z).

The encoder fθf_\theta maps the input to a compact representation. The decoder gϕg_\phi reconstructs the input from that representation.

The reconstruction loss is often

L(x,x^)=xx^2. L(x,\hat{x}) = \|x - \hat{x}\|^2.

In PyTorch:

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, hidden_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

model = Autoencoder(input_dim=784, hidden_dim=32)

x = torch.randn(64, 784)
x_hat, z = model(x)

loss = ((x - x_hat) ** 2).mean()

print(z.shape)     # torch.Size([64, 32])
print(loss.shape)  # torch.Size([])

Here the model compresses 784-dimensional inputs into 32-dimensional representations.

Representation Learning

Representation learning is the process of learning useful features from data.

A representation is a transformed version of an input:

z=fθ(x). z = f_\theta(x).

The representation zz should keep information that matters and discard information that does not.

For example:

InputUseful representation may capture
ImageShapes, textures, object parts
SentenceMeaning, syntax, entities
AudioPhonemes, speaker traits, rhythm
GraphNode roles, communities, connectivity
User behaviorPreferences, intent, habits

Deep learning is powerful because it can learn representations instead of relying only on manually designed features.

In modern systems, representations are often learned on large unlabeled datasets and reused for supervised tasks. A model trained on unlabeled text may learn embeddings useful for classification, retrieval, question answering, and summarization.

Density Estimation

Density estimation tries to learn the probability distribution that generated the data.

The goal is to model

pθ(x). p_\theta(x).

If the model assigns high probability to realistic examples and low probability to unrealistic examples, it has learned something about the data distribution.

Density estimation is central to generative modeling. A language model, for example, estimates the probability of a token sequence:

pθ(x1,x2,,xT). p_\theta(x_1, x_2, \dots, x_T).

Using the chain rule of probability, this can be written as

pθ(x1,,xT)=t=1Tpθ(xtx1,,xt1). p_\theta(x_1, \dots, x_T) = \prod_{t=1}^{T} p_\theta(x_t \mid x_1, \dots, x_{t-1}).

This objective uses no external human label. The sequence itself provides the training signal.

Generative Modeling

A generative model learns to produce new data that resembles the training data.

Examples include:

Data typeGenerated output
TextArticles, code, answers
ImagesPhotorealistic pictures
AudioSpeech or music
VideoMotion sequences
MoleculesCandidate chemical structures

A generative model may learn either an explicit probability distribution or an implicit sampling process.

Important families include:

Model familyCore idea
Autoregressive modelsGenerate one element at a time
Variational autoencodersLearn latent variables
Generative adversarial networksTrain generator against discriminator
Normalizing flowsLearn invertible transformations
Diffusion modelsLearn to reverse a noise process

Generative modeling is often unsupervised because training examples do not require external labels. The model learns from the structure of the data itself.

Anomaly Detection

Anomaly detection identifies examples that differ from normal data.

A model is trained on ordinary examples. At inference time, unusual examples receive high anomaly scores.

For an autoencoder, one simple anomaly score is reconstruction error:

s(x)=xx^2. s(x) = \|x - \hat{x}\|^2.

If the model reconstructs normal examples well but reconstructs unusual examples poorly, then high reconstruction error indicates an anomaly.

Applications include:

DomainAnomaly
CybersecuritySuspicious network behavior
ManufacturingDefective parts
FinanceFraudulent transactions
MedicineUnusual scans
InfrastructureSensor failures

Anomaly detection is difficult because anomalies are rare and diverse. The model may see many examples of normal behavior but few examples of failure.

Unsupervised Learning in PyTorch

A basic unsupervised training loop looks similar to supervised training. The difference is that there may be no external label.

For an autoencoder:

for x_batch in dataloader:
    optimizer.zero_grad()

    x_hat, z = model(x_batch)

    loss = ((x_batch - x_hat) ** 2).mean()

    loss.backward()
    optimizer.step()

The input itself acts as the target. This pattern appears in many unsupervised models.

For contrastive or self-supervised methods, the training loop may create artificial views of the same input:

for x_batch in dataloader:
    x1 = augment(x_batch)
    x2 = augment(x_batch)

    z1 = encoder(x1)
    z2 = encoder(x2)

    loss = contrastive_loss(z1, z2)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

The dataset still contains only xx, but the training procedure constructs a learning signal from transformations of xx.

Unsupervised Versus Supervised Learning

The practical difference is the source of the training signal.

PropertySupervised learningUnsupervised learning
DataInputs and targetsInputs only
Example(x,y)(x, y)xx
ObjectivePredict labels or valuesDiscover structure
Common tasksClassification, regressionClustering, compression, generation
CostOften needs labelsCan use unlabeled data
RiskLabel bias, overfittingHarder evaluation

Unsupervised learning can use much larger datasets because unlabeled data is abundant. This makes it important for modern deep learning, where large-scale pretraining often depends on weak, implicit, or self-generated training signals.

Limitations

Unsupervised learning has several limitations.

First, the objective may not match the final task. A model may learn structure that is mathematically valid but practically useless.

Second, evaluation is harder. In classification, accuracy is easy to measure. In unsupervised learning, there may be no single correct answer.

Third, learned representations may encode unwanted biases from the data.

Fourth, generative models may learn to imitate surface statistics without learning deeper causal structure.

Finally, unsupervised methods often require large datasets and careful objective design.

Summary

Unsupervised learning learns from inputs without explicit labels. It aims to discover structure in the data distribution.

The main tasks include clustering, dimensionality reduction, representation learning, density estimation, generative modeling, and anomaly detection.

In deep learning, unsupervised learning is important because it can use large unlabeled datasets. Autoencoders, language models, diffusion models, and contrastive systems all rely on the idea that useful training signals can be extracted from the data itself.