A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks.
A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks. Like an ordinary autoencoder, it has an encoder and a decoder. Unlike an ordinary autoencoder, it treats the latent representation as a random variable.
The goal is not only to compress and reconstruct data. The goal is to learn a latent probability model that can generate new samples.
An ordinary autoencoder learns a deterministic code:
A variational autoencoder learns a distribution over codes:
The decoder then models the probability of data given a latent variable:
This probabilistic formulation makes VAEs useful for generation, interpolation, uncertainty estimation, and representation learning.
Latent Variable Models
A latent variable model assumes that observed data is generated from an unobserved variable .
The generative story is:
- Sample a latent vector from a prior distribution:
- Generate data from a conditional distribution:
A common prior is the standard normal distribution:
The decoder is a neural network that maps to the parameters of a distribution over . For real-valued data, this may be a Gaussian distribution. For binary data, this may be a Bernoulli distribution.
The marginal likelihood of an observed example is
This integral is usually intractable for neural decoders. VAEs solve this using approximate inference.
Approximate Posterior
The true posterior distribution over latent variables is
This tells us which latent variables are likely to have generated the observed input . But computing it requires the marginal likelihood , which contains an intractable integral.
A VAE introduces an encoder distribution
to approximate the true posterior. This is called the variational posterior or approximate posterior.
A common choice is a diagonal Gaussian:
The encoder network outputs two vectors:
and
These parameterize the distribution over latent codes.
The Evidence Lower Bound
Training a VAE maximizes a lower bound on the log likelihood of the data. This lower bound is called the evidence lower bound, or ELBO.
For one example , the ELBO is
The first term is the reconstruction term. It encourages the decoder to reconstruct from latent samples .
The second term is the regularization term. It encourages the approximate posterior to stay close to the prior .
Training usually minimizes the negative ELBO:
So the loss can be written as
This form looks similar to a regularized autoencoder, but the interpretation is probabilistic.
Reconstruction Term
The reconstruction term depends on the likelihood model.
For binary data, we may use a Bernoulli likelihood:
The negative log likelihood becomes binary cross-entropy.
For real-valued data, we may use a Gaussian likelihood:
If is fixed, the negative log likelihood is proportional to mean squared error:
Thus, VAEs often use reconstruction losses that look familiar from ordinary autoencoders. The difference is that reconstruction is performed from a sampled latent variable.
KL Divergence Term
For the common case
and
the KL divergence has a closed form:
This term pushes toward zero and toward one. It prevents the encoder from placing each example in an isolated latent region. This makes sampling possible: if the decoder has learned to reconstruct from latent codes close to the prior, then new latent samples from can produce meaningful outputs.
The Reparameterization Trick
Sampling from creates a problem for gradient descent. We need gradients to flow through the sampling operation into the encoder parameters.
The reparameterization trick rewrites the sample as a deterministic function of the encoder outputs and external noise.
Instead of sampling
we sample
and compute
Now randomness enters through , which is independent of and . Gradients can flow through the deterministic computation of .
In PyTorch:
def reparameterize(mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + std * epsThe encoder usually outputs logvar rather than std directly because log variance is numerically more stable.
VAE Architecture in PyTorch
A simple VAE for flattened images can be written as follows:
import torch
from torch import nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim: int = 784, latent_dim: int = 32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
)
self.mu = nn.Linear(256, latent_dim)
self.logvar = nn.Linear(256, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256),
nn.ReLU(),
nn.Linear(256, 512),
nn.ReLU(),
nn.Linear(512, input_dim),
nn.Sigmoid(),
)
def encode(self, x: torch.Tensor):
h = self.encoder(x)
return self.mu(h), self.logvar(h)
def reparameterize(
self,
mu: torch.Tensor,
logvar: torch.Tensor,
) -> torch.Tensor:
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + std * eps
def decode(self, z: torch.Tensor):
return self.decoder(z)
def forward(self, x: torch.Tensor):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
x_hat = self.decode(z)
return x_hat, mu, logvarThis model has four conceptual pieces:
| Component | Role |
|---|---|
| Encoder trunk | Extracts features from input |
| Mean head | Outputs |
| Log-variance head | Outputs |
| Decoder | Reconstructs data from sampled |
VAE Loss in PyTorch
The VAE loss combines reconstruction loss and KL loss.
For normalized binary-like images, binary cross-entropy is common:
def vae_loss(x_hat, x, mu, logvar):
recon = F.binary_cross_entropy(
x_hat,
x,
reduction="sum",
)
kl = -0.5 * torch.sum(
1 + logvar - mu.pow(2) - logvar.exp()
)
return recon + klThe KL expression is equivalent to
The code uses logvar.exp() to recover .
A training step:
model = VAE(input_dim=784, latent_dim=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
x = torch.rand(128, 784)
x_hat, mu, logvar = model(x)
loss = vae_loss(x_hat, x, mu, logvar)
optimizer.zero_grad()
loss.backward()
optimizer.step()For reporting, it is useful to track reconstruction loss and KL loss separately. This helps diagnose collapse or over-regularization.
Sampling from a VAE
After training, a VAE can generate new samples by drawing from the prior:
and decoding:
In code:
model.eval()
with torch.no_grad():
z = torch.randn(16, 32)
samples = model.decode(z)The tensor samples contains generated outputs.
This is the main difference between a standard autoencoder and a VAE. A standard autoencoder may reconstruct well, but its latent space may have gaps. Random latent vectors may decode to invalid outputs. A VAE regularizes the latent space so that samples from the prior are likely to decode into valid data.
Latent Interpolation
VAEs often produce smooth latent spaces. Given two examples and , we can encode them into latent means:
Then interpolate:
Decoding should produce a smooth transition between the two examples.
In PyTorch:
def interpolate(model, x_a, x_b, steps: int = 10):
model.eval()
with torch.no_grad():
mu_a, _ = model.encode(x_a)
mu_b, _ = model.encode(x_b)
outputs = []
for i in range(steps):
t = i / (steps - 1)
z = (1 - t) * mu_a + t * mu_b
x_hat = model.decode(z)
outputs.append(x_hat)
return torch.cat(outputs, dim=0)Interpolation is useful for inspecting the geometry of the latent space. Smooth transitions indicate that nearby latent points decode to related outputs.
The Reconstruction and Regularization Tradeoff
The VAE objective contains a tradeoff.
The reconstruction term wants the latent variable to preserve detailed information about the input. The KL term wants the approximate posterior to stay close to the prior. If the KL term is too strong, the latent code may contain too little information. If the reconstruction term dominates, the latent space may become poorly organized for sampling.
This tradeoff is often controlled by a coefficient:
This model is called a beta-VAE when is treated as an explicit hyperparameter.
If
the model puts more pressure on the latent space to match the prior. This can encourage disentangled representations, but it can also reduce reconstruction quality.
If
the model allows the latent code to carry more information, often improving reconstruction but weakening generative sampling.
Posterior Collapse
Posterior collapse occurs when the encoder distribution becomes nearly equal to the prior:
Then carries little or no information about . The decoder ignores the latent variable and learns to generate outputs without using it.
This problem is common when the decoder is very powerful, such as an autoregressive language model. The decoder can model the data directly and has little incentive to use .
Symptoms include:
| Symptom | Meaning |
|---|---|
| KL loss near zero | Latent code matches prior too closely |
| Similar reconstructions for different inputs | Encoder carries little information |
| Decoder ignores changes in | Latent space is unused |
Common mitigation strategies include KL annealing, weaker decoders, skip connections, free bits, and changing the training objective.
KL annealing gradually increases the KL coefficient during training:
where starts near zero and increases toward one.
Disentanglement
A disentangled representation separates independent factors of variation into different latent coordinates.
For example, in an image dataset, different latent dimensions might represent:
| Latent coordinate | Possible factor |
|---|---|
| Rotation | |
| Stroke thickness | |
| Object size | |
| Lighting | |
| Background |
Beta-VAEs are often studied for disentanglement because stronger KL pressure can encourage simpler, more factorized latent representations.
However, disentanglement is difficult to define and measure. It often depends on the dataset, inductive biases, supervision, and evaluation procedure. Unsupervised disentanglement has fundamental limitations without assumptions about the data-generating process.
Conditional VAEs
A conditional VAE models data conditioned on an observed variable , such as a class label.
The generative model becomes
The encoder may also condition on :
The decoder receives both and , allowing class-controlled generation.
For example, on a digit dataset, may specify the desired digit class. Sampling proceeds by choosing a class , drawing
and decoding from .
Conditional VAEs are useful when generation should be controlled by labels, attributes, text, or other side information.
Limits of VAEs
VAEs are elegant and stable to train, but they have limitations.
First, reconstructions may be blurry when using simple Gaussian likelihoods or pixelwise losses.
Second, the prior may be too simple. A standard normal prior may not match the true aggregate latent distribution well.
Third, posterior collapse can make the latent code unused.
Fourth, VAEs may produce lower visual fidelity than GANs or diffusion models on image generation tasks.
Fifth, likelihood and sample quality may not align. A model can assign reasonable likelihood while producing poor-looking samples.
These limitations led to many extensions, including hierarchical VAEs, normalizing-flow priors, vector-quantized VAEs, diffusion decoders, and hybrid models.
Summary
A variational autoencoder is a probabilistic autoencoder. The encoder approximates a posterior distribution over latent variables. The decoder models data generation from those latent variables.
The VAE objective combines reconstruction quality with a KL penalty that regularizes the latent distribution toward a prior. The reparameterization trick allows stochastic latent sampling while preserving gradient-based training.
VAEs provide a principled bridge between autoencoders and probabilistic generative modeling. They remain important for latent representation learning, controlled generation, uncertainty-aware modeling, and as components inside larger generative systems.