Dimensionality Reduction

Deep learning often begins with data that has many coordinates. An image may contain hundreds of thousands of pixel values. A document may be represented by thousands of token counts. A biological measurement may contain expressions for tens of thousands of genes. A user profile, a graph node, or a sensor trace may also have a high-dimensional representation.

Dimensionality reduction is the problem of replacing a high-dimensional representation with a lower-dimensional one while preserving the information needed for a task. If the original input is

x \in \mathbb{R}^D,

we want to construct a lower-dimensional representation

z \in \mathbb{R}^d, \quad d < D.

The vector $z$ is often called a latent representation, code, embedding, or bottleneck representation. The choice of term depends on the model family and application.

In representation learning, dimensionality reduction has two goals. The first goal is compression: represent the input using fewer numbers. The second goal is abstraction: remove irrelevant variation and preserve meaningful structure. A useful low-dimensional representation does more than shrink the data. It exposes regularities that are easier for a model to use.

Why Reduce Dimension?

High-dimensional data is expensive to store, process, visualize, and learn from. If each example is a vector in $\mathbb{R}^D$ , then many algorithms require computation or memory that grows with $D$ . Reducing the dimension can reduce cost.

There is also a statistical reason. In high dimensions, data can become sparse. Distances become harder to interpret. Many regions of the input space contain no training examples. A learning algorithm may need many examples to estimate a reliable function over such a space.

This problem is sometimes called the curse of dimensionality. It does not mean high-dimensional learning is impossible. Modern deep networks learn successfully in very high-dimensional spaces. The point is more precise: without structure, high-dimensional problems require too much data. Dimensionality reduction works when the data has lower-dimensional structure.

For example, a $28 \times 28$ grayscale image has

D = 784

pixel values. But handwritten digits do not fill all of $\mathbb{R}^{784}$ . Most 784-dimensional vectors do not look like digits. Real digit images lie near a smaller set of structured configurations: strokes, curves, thickness, slant, and position. A lower-dimensional representation can capture these factors.

Intrinsic Dimension

The ambient dimension is the number of coordinates used to store the data. The intrinsic dimension is the number of degrees of freedom needed to describe the meaningful variation in the data.

A point on a line in three-dimensional space has ambient dimension 3 but intrinsic dimension 1. A point on a surface has ambient dimension 3 but intrinsic dimension 2. Similarly, an image may have many pixels but fewer meaningful factors of variation.

Suppose images of one object are generated by changing its rotation angle, horizontal position, vertical position, lighting, and scale. Although each image may have thousands of pixels, the generating process may depend on only a few variables. Dimensionality reduction attempts to recover or approximate such lower-dimensional factors.

In deep learning, we often do not explicitly know the intrinsic variables. Instead, we learn a function

f_\theta : \mathbb{R}^D \to \mathbb{R}^d

that maps an input $x$ to a representation

z = f_\theta(x).

The parameters $\theta$ are learned from data.

Linear Dimensionality Reduction

The simplest dimensionality reduction methods use linear maps. A linear encoder maps

x \in \mathbb{R}^D

z \in \mathbb{R}^d

z = W x,

where

W \in \mathbb{R}^{d \times D}.

The rows of $W$ define directions in the input space. Each coordinate of $z$ is a projection of $x$ onto one of those directions.

A linear decoder maps the low-dimensional vector back to the original space:

\hat{x} = Vz,

where

V \in \mathbb{R}^{D \times d}.

The reconstructed input $\hat{x}$ is an approximation of $x$ . The quality of the representation can be measured by reconstruction error:

\|x - \hat{x}\|^2.

If $d$ is much smaller than $D$ , exact reconstruction is generally impossible. The model must preserve the most important variation and discard the rest.

Principal Component Analysis

Principal component analysis, or PCA, is the classical linear method for dimensionality reduction. PCA finds directions of maximum variance in the data. These directions are called principal components.

Let the data be centered, so its empirical mean is zero. Given examples

x_1, x_2, \ldots, x_N \in \mathbb{R}^D,

PCA seeks a low-dimensional subspace that minimizes squared reconstruction error.

For a $d$ -dimensional subspace with orthonormal basis vectors stored in a matrix

U \in \mathbb{R}^{D \times d}, \quad U^\top U = I,

the encoding is

z = U^\top x,

and the reconstruction is

\hat{x} = Uz = UU^\top x.

The PCA objective is

\min_U \sum_{i=1}^N \|x_i - UU^\top x_i\|^2, \quad U^\top U = I.

The solution is given by the top $d$ eigenvectors of the empirical covariance matrix. If

\Sigma = \frac{1}{N}\sum_{i=1}^N x_i x_i^\top,

then PCA uses the eigenvectors corresponding to the largest eigenvalues of $\Sigma$ .

Large eigenvalues indicate directions along which the data varies strongly. Small eigenvalues indicate directions with little variation, often noise or fine detail.

PCA as a Linear Autoencoder

An autoencoder is a neural network trained to reconstruct its input. It contains an encoder and a decoder:

z = f_\theta(x), \quad \hat{x} = g_\phi(z).

The model is trained by minimizing reconstruction loss:

L(x,\hat{x}) = \|x - \hat{x}\|^2.

If the encoder and decoder are linear, the bottleneck dimension is $d$ , and the loss is mean squared reconstruction error, then the learned subspace is closely related to PCA. Under standard assumptions, a linear autoencoder learns the same principal subspace as PCA, although the basis inside that subspace need not match the PCA basis exactly.

This connection is important. It shows that autoencoders generalize classical dimensionality reduction. PCA uses a linear projection. A nonlinear autoencoder uses learned nonlinear functions:

z = f_\theta(x), \quad \hat{x} = g_\phi(z).

The nonlinear case can represent curved low-dimensional structures that a linear method cannot capture.

Nonlinear Dimensionality Reduction

Many real datasets do not lie near a flat linear subspace. They may lie near a curved manifold. In that case, a linear projection can distort the structure.

Consider images of a rotating object. The true factor may be an angle. As the angle changes, the pixel vector traces a curved path in the high-dimensional input space. PCA can approximate this curve with a flat subspace, but a nonlinear model can represent it more naturally.

A nonlinear encoder has the form

z = f_\theta(x),

where $f_\theta$ is a neural network. A nonlinear decoder reconstructs the input:

\hat{x} = g_\phi(z).

The training objective is usually

\min_{\theta,\phi} \frac{1}{N} \sum_{i=1}^N L(x_i, g_\phi(f_\theta(x_i))).

For real-valued inputs, the loss is often mean squared error. For binary inputs, the loss may be binary cross-entropy. For images, losses may combine pixel error, perceptual error, adversarial loss, or diffusion-based objectives.

The Bottleneck Principle

A bottleneck is a layer whose dimension is smaller than the input dimension. If the encoder maps

\mathbb{R}^D \to \mathbb{R}^d

with

d < D,

then the model must compress information. The decoder cannot reconstruct the input perfectly unless the representation preserves the necessary information.

This pressure makes the bottleneck useful. The encoder must learn features that summarize the input. For images, such features might include edges, shapes, pose, and object identity. For text, they might include topic, syntax, sentiment, or semantic content.

The bottleneck alone does not guarantee useful representations. A model can learn poor codes if the objective is poorly chosen. For example, pixel reconstruction may force the model to preserve low-level detail while ignoring semantic structure. A good dimensionality reduction method must match the training objective to the desired representation.

Undercomplete Autoencoders

An undercomplete autoencoder has a latent dimension smaller than the input dimension:

d < D.

This is the most direct neural form of dimensionality reduction. The encoder compresses the input into $z$ , and the decoder reconstructs the input from $z$ .

A simple PyTorch autoencoder for flattened images can be written as follows:

import torch
from torch import nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim: int, latent_dim: int):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, latent_dim),
        )

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat

A training step minimizes reconstruction error:

model = Autoencoder(input_dim=784, latent_dim=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

x = torch.randn(64, 784)

x_hat = model(x)
loss = loss_fn(x_hat, x)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Here the model compresses each 784-dimensional input into a 32-dimensional latent vector.

Choosing the Latent Dimension

The latent dimension controls the compression strength.

If $d$ is too small, the representation cannot preserve enough information. Reconstructions become blurry, incomplete, or inaccurate. If $d$ is too large, the model may copy the input too easily. The representation may preserve noise and irrelevant details.

There is no universal correct value of $d$ . It depends on the dataset, model capacity, loss function, and downstream task.

A common procedure is to train several models with different latent dimensions and measure reconstruction error or downstream performance.

Latent dimension	Expected behavior
Very small	Strong compression, high reconstruction error
Moderate	Useful abstraction and acceptable reconstruction
Very large	Weak compression, possible memorization

For representation learning, downstream performance may matter more than reconstruction quality. A representation that reconstructs every pixel accurately may still be less useful for classification than a representation that discards nuisance variation.

Dimensionality Reduction for Visualization

Dimensionality reduction is often used to visualize high-dimensional data. A representation may be reduced to two or three dimensions and plotted.

Classical methods include PCA, t-SNE, and UMAP. PCA preserves global linear variance. t-SNE and UMAP are nonlinear methods that often reveal local clusters.

Visualization methods should be interpreted carefully. A two-dimensional plot can suggest structure, but it can also introduce distortion. Distances, cluster sizes, and empty regions may not correspond directly to the original high-dimensional geometry.

In deep learning practice, a common workflow is:

Train a model.
Extract hidden representations.
Reduce them to two dimensions.
Plot them with labels or metadata.
Inspect whether semantically similar examples appear near each other.

This can reveal whether a model has learned useful structure, but it should not replace quantitative evaluation.

Dimensionality Reduction Versus Feature Selection

Dimensionality reduction and feature selection both reduce the amount of information passed to a model, but they do so differently.

Feature selection chooses a subset of original features. Dimensionality reduction constructs new features from the original ones.

For example, if the original vector is

x = [x_1, x_2, x_3, x_4],

feature selection might keep

[x_1, x_3].

Dimensionality reduction might construct

z_1 = 0.5x_1 - 0.2x_2 + 1.7x_4,

z_2 = -0.1x_1 + 0.8x_2 + 0.3x_3.

Feature selection is easier to interpret because the selected coordinates retain their original meaning. Dimensionality reduction is more flexible because the new coordinates can combine information from many input features.

Dimensionality Reduction in Modern Deep Learning

Modern deep learning uses dimensionality reduction in several forms.

In convolutional networks, pooling and strided convolution reduce spatial dimensions while increasing feature abstraction. An image may begin as

[B, 3, 224, 224]

and gradually become

[B, 2048, 7, 7].

The spatial resolution is reduced, while the channel dimension grows.

In transformers, token embeddings project discrete token IDs into continuous vectors. A vocabulary of size $V$ is represented through an embedding table

E \in \mathbb{R}^{V \times D}.

A token ID selects one row of this table. Although this is not dimensionality reduction in the same sense as PCA, it maps symbolic sparse inputs into dense learned representations.

In large models, dimensionality reduction also appears in bottleneck adapters, low-rank adaptation, quantization, distillation, and retrieval embeddings. These methods reduce computation, memory, or representational complexity while preserving useful behavior.

Compression, Reconstruction, and Semantics

A central tension in dimensionality reduction is the difference between reconstruction and semantics.

A model trained only to reconstruct the input may preserve details that are irrelevant for a downstream task. For example, an image autoencoder may spend capacity reconstructing background texture rather than object identity. A text autoencoder may preserve surface form rather than meaning.

A representation is useful when it preserves the factors needed for the task. For classification, this might mean class identity. For retrieval, it might mean semantic similarity. For generation, it might mean enough information to synthesize a plausible sample. For control, it might mean state variables that predict future outcomes.

This is why modern representation learning often uses objectives beyond plain reconstruction: contrastive learning, masked prediction, denoising, clustering, temporal prediction, or supervised auxiliary losses.

Practical Checks

When using dimensionality reduction, inspect both the representation and the reconstruction.

Useful checks include:

Check	Question
Reconstruction loss	Does the decoder recover the input accurately enough?
Latent dimension sweep	How does performance change as $d$ changes?
Nearest neighbors	Do nearby latent vectors correspond to similar examples?
Interpolation	Do paths between latent vectors decode smoothly?
Downstream task	Does the representation improve classification, retrieval, or prediction?
Robustness	Does the code ignore noise and nuisance variation?

Latent interpolation is especially useful for generative models. Given two latent vectors $z_a$ and $z_b$ , define

z(t) = (1-t)z_a + tz_b, \quad 0 \le t \le 1.

If decoding $z(t)$ produces a smooth transition, the latent space has learned a meaningful geometry. If the decoded samples become invalid between endpoints, the latent space may be poorly organized.

Summary

Dimensionality reduction maps high-dimensional inputs to lower-dimensional representations. The goal is to preserve useful structure while removing redundancy, noise, or irrelevant variation.

Linear methods such as PCA find low-dimensional subspaces. Autoencoders generalize this idea by learning nonlinear encoders and decoders. Undercomplete autoencoders impose compression through a bottleneck. The quality of the learned representation depends on the latent dimension, model capacity, data distribution, and training objective.

In deep learning, dimensionality reduction is not only a preprocessing tool. It is part of how models learn abstractions. Layers transform raw input into progressively more useful representations, often reducing some dimensions while increasing others. The central question is always the same: what information should be preserved, and what information can be discarded?