Dimensionality Reduction

High-dimensional data often contains structure that can be described with fewer variables than the raw representation suggests. An image with $224 \times 224 \times 3$ pixels has 150,528 numerical values, but natural images occupy a much smaller part of that space. A sentence may contain many tokens, but its meaning can often be summarized by a shorter vector. A user profile may contain thousands of observed interactions, but many of those interactions reflect a smaller number of latent preferences.

Dimensionality reduction is the study of mapping data from a high-dimensional space to a lower-dimensional space while preserving the information needed for a task. In deep learning, this idea appears in autoencoders, embeddings, latent variable models, representation learning, compression, visualization, retrieval, and generative modeling.

An encoder maps an input $x$ to a lower-dimensional representation $z$ :

z = f_\theta(x).

The representation $z$ is often called a latent vector, code, bottleneck representation, or embedding. A decoder maps this representation back to the input space:

\hat{x} = g_\phi(z).

The model is trained so that $\hat{x}$ is close to $x$ . The lower-dimensional vector $z$ must therefore retain the information needed to reconstruct the input.

The Basic Problem

Let the input data lie in $\mathbb{R}^D$ . A single example is

x \in \mathbb{R}^D.

Dimensionality reduction seeks a representation

z \in \mathbb{R}^d

where

d < D.

The mapping

f: \mathbb{R}^D \to \mathbb{R}^d

compresses the data, and the mapping

g: \mathbb{R}^d \to \mathbb{R}^D

attempts to reconstruct it.

The reconstruction is

\hat{x} = g(f(x)).

A common training objective is the squared reconstruction error:

L(x,\hat{x}) = \|x - \hat{x}\|_2^2.

For a dataset $\{x_i\}_{i=1}^N$ , the empirical objective is

\min_{\theta,\phi} \frac{1}{N} \sum_{i=1}^N \|x_i - g_\phi(f_\theta(x_i))\|_2^2.

This objective says that the encoder and decoder should preserve as much information as possible about the original examples, subject to the bottleneck dimension $d$ .

The bottleneck matters. If $d$ is too small, the model cannot store enough information and reconstruction quality is poor. If $d$ is too large, the model may learn an almost identity mapping and fail to produce a useful compressed representation.

Compression and Representation

Dimensionality reduction has two related goals.

The first goal is compression. We want a smaller representation that uses fewer numbers than the original input. Compression is useful when storing, transmitting, indexing, or comparing data.

The second goal is representation learning. We want a representation that exposes useful structure. A good latent vector may group similar inputs together, separate different classes, remove noise, or reveal factors of variation.

These goals overlap, but they differ. A representation can compress the input while preserving irrelevant details. Another representation may discard some input details but preserve the information needed for classification, retrieval, or generation.

For example, an image representation may ignore small pixel noise but preserve object identity. A speech representation may ignore microphone artifacts but preserve phonetic content. A text representation may ignore formatting but preserve semantic meaning.

Linear Dimensionality Reduction

The classical reference point is linear dimensionality reduction. In a linear method, the representation is obtained by projecting the input onto a lower-dimensional subspace.

Assume the data has been centered so that its empirical mean is zero. A linear encoder can be written as

z = W^\top x,

where

W \in \mathbb{R}^{D \times d}.

The columns of $W$ define directions in the original space. The vector $z$ contains the coordinates of $x$ in those directions.

A corresponding linear decoder can be written as

\hat{x} = Wz = WW^\top x.

If the columns of $W$ are orthonormal, then $WW^\top x$ is the projection of $x$ onto the $d$ -dimensional subspace spanned by those columns.

Principal component analysis, or PCA, chooses the subspace that minimizes squared reconstruction error among all $d$ -dimensional linear subspaces. Equivalently, PCA chooses directions of maximum variance.

For centered data, PCA solves

\min_W \sum_{i=1}^N \|x_i - WW^\top x_i\|_2^2

subject to

W^\top W = I.

The solution is given by the top $d$ eigenvectors of the data covariance matrix. These eigenvectors are the principal components.

PCA is important because it gives a clean baseline. If a nonlinear model does no better than PCA, the data may be mostly linear, the model may be too weak, or the training objective may be poorly chosen.

Autoencoders as Nonlinear Dimensionality Reduction

An autoencoder generalizes linear dimensionality reduction by replacing the linear encoder and decoder with neural networks.

The encoder is

z = f_\theta(x),

and the decoder is

\hat{x} = g_\phi(z).

Both $f_\theta$ and $g_\phi$ may contain nonlinear layers. This allows the model to represent curved manifolds rather than only flat subspaces.

For example, suppose the data lies near a curved surface in a high-dimensional space. PCA can only approximate this structure with a flat subspace. A nonlinear autoencoder can learn coordinates along the curved surface.

A simple autoencoder has the form

x \to h_1 \to z \to h_2 \to \hat{x}.

The layer $z$ is the bottleneck. It forces the network to choose a compact internal representation.

In PyTorch, a small fully connected autoencoder can be written as follows:

import torch
from torch import nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim: int, latent_dim: int):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, latent_dim),
        )

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

A training step minimizes reconstruction error:

model = Autoencoder(input_dim=784, latent_dim=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

x = torch.randn(64, 784)

x_hat, z = model(x)
loss = loss_fn(x_hat, x)

optimizer.zero_grad()
loss.backward()
optimizer.step()

print(z.shape)  # torch.Size([64, 32])

For images such as MNIST, the input dimension $784$ corresponds to $28 \times 28$ pixels. The latent dimension $32$ forces the model to describe each image using only 32 numbers.

Reconstruction Losses

The reconstruction loss should match the data type and modeling assumption.

For real-valued inputs, mean squared error is common:

L(x,\hat{x}) = \|x-\hat{x}\|_2^2.

This corresponds to a Gaussian observation model when the variance is fixed.

For binary inputs or normalized pixels interpreted as Bernoulli probabilities, binary cross-entropy may be used:

L(x,\hat{x}) = -\sum_{j=1}^D \left[ x_j \log \hat{x}_j + (1-x_j)\log(1-\hat{x}_j) \right].

For categorical outputs, cross-entropy is used. For images, perceptual losses may compare features extracted by another network rather than raw pixels. For audio, losses may be applied in waveform space, spectrogram space, or both.

The loss determines what information the latent representation must preserve. Pixel-level losses encourage exact reconstruction. Perceptual losses may allow small pixel differences while preserving visual structure. Task-specific losses may encourage representations useful for prediction rather than full reconstruction.

The Bottleneck

The bottleneck is the part of the model that limits information flow. In an undercomplete autoencoder, the latent dimension is smaller than the input dimension:

d < D.

This constraint prevents the model from copying the input directly through a simple identity function. The encoder must learn which information to keep.

However, dimensionality alone is not the only bottleneck. A model can also be constrained by noise, sparsity, quantization, regularization, or probabilistic structure. Later sections discuss sparse autoencoders, denoising autoencoders, and variational autoencoders. Each imposes a different kind of pressure on the latent representation.

A useful bottleneck should remove nuisance variation while preserving meaningful factors. In images, nuisance variation may include sensor noise or tiny pixel shifts. Meaningful factors may include object shape, color, pose, and identity.

Manifold View

Many high-dimensional datasets are believed to concentrate near lower-dimensional manifolds. A manifold is a space that may be curved globally but looks locally low-dimensional.

For example, images of a rotating object may live in a very high-dimensional pixel space, but the rotation angle may be one of the main factors of variation. If lighting and position are fixed, the intrinsic dimension may be much smaller than the number of pixels.

Dimensionality reduction can be viewed as learning coordinates on such a manifold. The encoder maps a high-dimensional point to latent coordinates. The decoder maps latent coordinates back to the data space.

This view explains why dimensionality reduction can work. The raw input dimension may be large, but the data distribution may occupy only a structured subset of the full space.

It also explains a limitation. If the data contains many independent factors of variation, an overly small latent space cannot represent them all. The model must discard information. Which information gets discarded depends on the architecture, the loss, and the training data.

Visualization

Dimensionality reduction is often used for visualization. A high-dimensional representation can be mapped to two or three dimensions and plotted.

For example, an encoder may produce embeddings

z_i \in \mathbb{R}^{128}.

A visualization method may map them to

u_i \in \mathbb{R}^2.

The resulting plot can reveal clusters, outliers, or class structure.

Common visualization methods include PCA, t-SNE, and UMAP. PCA is linear and preserves directions of maximum variance. t-SNE emphasizes local neighborhood structure. UMAP also emphasizes neighborhood structure and is often used for large embedding sets.

Visualization must be interpreted carefully. A two-dimensional plot can distort distances and densities. Points that appear close may not be close in the original space, and points that appear separated may still overlap in high-dimensional space. Such plots are diagnostic tools, not proofs of class separation.

Retrieval and Similarity

Low-dimensional representations are useful for retrieval. Instead of comparing raw inputs, a system compares embeddings.

Given two inputs $x_i$ and $x_j$ , the encoder produces

z_i = f_\theta(x_i), \quad z_j = f_\theta(x_j).

Similarity can be measured with Euclidean distance,

\|z_i-z_j\|_2,

or cosine similarity,

\frac{z_i^\top z_j}{\|z_i\|_2\|z_j\|_2}.

Retrieval systems use this principle for images, documents, audio clips, products, and user profiles. The quality of retrieval depends on whether the embedding space reflects the desired notion of similarity.

An autoencoder trained only for reconstruction may learn features that preserve low-level details rather than semantic similarity. Contrastive learning and supervised metric learning often produce better retrieval embeddings when semantic similarity is the target.

Dimensionality Reduction Versus Feature Learning

Dimensionality reduction and feature learning are closely related. Both transform data into a representation. The difference lies in the objective.

Dimensionality reduction usually emphasizes compactness. It asks: can we describe the input with fewer coordinates?

Feature learning emphasizes usefulness. It asks: can we produce a representation that makes a downstream task easier?

An autoencoder may serve both purposes. If the latent vector is compact and useful for classification, retrieval, clustering, or generation, then it is both a reduced representation and a learned feature vector.

In modern deep learning, the best representations are often learned from large-scale predictive or contrastive objectives rather than from reconstruction alone. Still, autoencoders remain important because they provide a direct and interpretable framework for studying compression, latent variables, and generative modeling.

Practical Shape Conventions

Suppose an input batch has shape

[B, D]

where $B$ is batch size and $D$ is input dimension. The encoder returns

[B, d]

where $d$ is the latent dimension. The decoder returns

[B, D]

The reconstruction loss compares tensors with the same shape:

x.shape      == torch.Size([B, D])
x_hat.shape  == torch.Size([B, D])
z.shape      == torch.Size([B, d])

For images, a convolutional autoencoder may use

[B, C, H, W]

as input. The latent representation may be a vector

[B, d]

or a smaller feature map

[B, C_latent, H_latent, W_latent]

A feature-map latent representation preserves spatial layout. A vector latent representation discards explicit spatial structure and forces a more global compression.

Convolutional Autoencoders

For images, fully connected autoencoders ignore spatial locality. A convolutional autoencoder uses convolutional layers in the encoder and decoder.

The encoder reduces spatial resolution while increasing channel depth:

[B, 3, 64, 64]
-> [B, 32, 32, 32]
-> [B, 64, 16, 16]
-> [B, 128, 8, 8]

The decoder reverses this process:

[B, 128, 8, 8]
-> [B, 64, 16, 16]
-> [B, 32, 32, 32]
-> [B, 3, 64, 64]

A minimal convolutional autoencoder in PyTorch:

import torch
from torch import nn

class ConvAutoencoder(nn.Module):
    def __init__(self):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
        )

        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, kernel_size=4, stride=2, padding=1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

For inputs of shape [B, 3, 64, 64], this model produces a latent feature map of shape [B, 128, 8, 8]. The decoder reconstructs an image of the original shape.

The final Sigmoid is appropriate when pixels are scaled to the interval $[0,1]$ . If images are standardized to have negative and positive values, a linear output or Tanh may be more suitable.

Choosing the Latent Dimension

The latent dimension controls the tradeoff between compression and reconstruction.

A smaller latent dimension gives stronger compression. It may encourage the model to learn abstract factors, but it may also discard useful detail.

A larger latent dimension gives better reconstruction. It may preserve more information, but it may also allow the model to memorize or copy inputs.

In practice, one often trains models with several latent dimensions and compares reconstruction quality, downstream performance, and embedding structure.

For example:

Latent dimension	Expected behavior
2	Useful for visualization, often poor reconstruction
16	Strong compression, coarse structure
64	Moderate compression, better detail
256	Weak compression, high reconstruction quality
1024	May approach identity mapping for simple data

There is no universally correct latent dimension. It depends on the intrinsic complexity of the data and the purpose of the representation.

Failure Modes

Dimensionality reduction can fail in several ways.

The latent space may preserve the wrong information. If the reconstruction loss rewards low-level detail, the encoder may preserve texture while discarding semantic structure.

The decoder may be too powerful. A strong decoder can sometimes reconstruct plausible outputs while relying weakly on the latent code. This issue appears strongly in some probabilistic autoencoders.

The latent space may be discontinuous. Nearby points in latent space may decode to very different outputs, making interpolation poor.

The model may memorize. If the dataset is small and the model is large, the autoencoder may learn training examples without discovering reusable structure.

The bottleneck may be too severe. If the latent dimension is too small, the reconstruction may blur, collapse, or lose important factors.

These failures are diagnosed by reconstruction error, held-out validation loss, latent visualization, interpolation tests, nearest-neighbor retrieval, and downstream task evaluation.

Summary

Dimensionality reduction maps high-dimensional data to a lower-dimensional representation. In deep learning, this representation is usually learned by an encoder. An autoencoder adds a decoder and trains the representation by reconstructing the input.

Linear dimensionality reduction leads to PCA. Nonlinear dimensionality reduction leads naturally to autoencoders. The latent representation can support compression, visualization, retrieval, denoising, and generative modeling.

The central design choice is the bottleneck. A useful bottleneck keeps information that explains the data while discarding irrelevant variation. In later sections, we will study undercomplete, sparse, denoising, and variational autoencoders as different ways to shape this bottleneck.