Skip to content

Variational Autoencoders

A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks.

A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks. Like an ordinary autoencoder, it has an encoder and a decoder. Unlike an ordinary autoencoder, it treats the latent representation as a random variable.

The goal is not only to compress and reconstruct data. The goal is to learn a latent probability model that can generate new samples.

An ordinary autoencoder learns a deterministic code:

z=fθ(x). z = f_\theta(x).

A variational autoencoder learns a distribution over codes:

qϕ(zx). q_\phi(z \mid x).

The decoder then models the probability of data given a latent variable:

pθ(xz). p_\theta(x \mid z).

This probabilistic formulation makes VAEs useful for generation, interpolation, uncertainty estimation, and representation learning.

Latent Variable Models

A latent variable model assumes that observed data xx is generated from an unobserved variable zz.

The generative story is:

  1. Sample a latent vector from a prior distribution:
zp(z). z \sim p(z).
  1. Generate data from a conditional distribution:
xpθ(xz). x \sim p_\theta(x \mid z).

A common prior is the standard normal distribution:

p(z)=N(0,I). p(z) = \mathcal{N}(0, I).

The decoder is a neural network that maps zz to the parameters of a distribution over xx. For real-valued data, this may be a Gaussian distribution. For binary data, this may be a Bernoulli distribution.

The marginal likelihood of an observed example is

pθ(x)=pθ(xz)p(z)dz. p_\theta(x) = \int p_\theta(x \mid z)p(z)\,dz.

This integral is usually intractable for neural decoders. VAEs solve this using approximate inference.

Approximate Posterior

The true posterior distribution over latent variables is

pθ(zx)=pθ(xz)p(z)pθ(x). p_\theta(z \mid x) = \frac{p_\theta(x \mid z)p(z)}{p_\theta(x)}.

This tells us which latent variables are likely to have generated the observed input xx. But computing it requires the marginal likelihood pθ(x)p_\theta(x), which contains an intractable integral.

A VAE introduces an encoder distribution

qϕ(zx) q_\phi(z \mid x)

to approximate the true posterior. This is called the variational posterior or approximate posterior.

A common choice is a diagonal Gaussian:

qϕ(zx)=N(z;μϕ(x),diag(σϕ2(x))). q_\phi(z \mid x) = \mathcal{N} \left( z; \mu_\phi(x), \operatorname{diag}(\sigma_\phi^2(x)) \right).

The encoder network outputs two vectors:

μϕ(x) \mu_\phi(x)

and

logσϕ2(x). \log \sigma_\phi^2(x).

These parameterize the distribution over latent codes.

The Evidence Lower Bound

Training a VAE maximizes a lower bound on the log likelihood of the data. This lower bound is called the evidence lower bound, or ELBO.

For one example xx, the ELBO is

L(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z)). \mathcal{L}(\theta,\phi;x) = \mathbb{E}_{z\sim q_\phi(z\mid x)} [ \log p_\theta(x\mid z) ] - \operatorname{KL} \left( q_\phi(z\mid x) \| p(z) \right).

The first term is the reconstruction term. It encourages the decoder to reconstruct xx from latent samples zz.

The second term is the regularization term. It encourages the approximate posterior qϕ(zx)q_\phi(z\mid x) to stay close to the prior p(z)p(z).

Training usually minimizes the negative ELBO:

LVAE=L(θ,ϕ;x). L_{\text{VAE}} = - \mathcal{L}(\theta,\phi;x).

So the loss can be written as

LVAE=Lrecon+LKL. L_{\text{VAE}} = L_{\text{recon}} + L_{\text{KL}}.

This form looks similar to a regularized autoencoder, but the interpretation is probabilistic.

Reconstruction Term

The reconstruction term depends on the likelihood model.

For binary data, we may use a Bernoulli likelihood:

pθ(xz)=iBernoulli(xi;x^i). p_\theta(x\mid z) = \prod_i \operatorname{Bernoulli}(x_i;\hat{x}_i).

The negative log likelihood becomes binary cross-entropy.

For real-valued data, we may use a Gaussian likelihood:

pθ(xz)=N(x;x^,σ2I). p_\theta(x\mid z) = \mathcal{N} (x; \hat{x}, \sigma^2 I).

If σ2\sigma^2 is fixed, the negative log likelihood is proportional to mean squared error:

xx^2. \|x-\hat{x}\|^2.

Thus, VAEs often use reconstruction losses that look familiar from ordinary autoencoders. The difference is that reconstruction is performed from a sampled latent variable.

KL Divergence Term

For the common case

qϕ(zx)=N(μ,diag(σ2)) q_\phi(z\mid x) = \mathcal{N} (\mu, \operatorname{diag}(\sigma^2))

and

p(z)=N(0,I), p(z)=\mathcal{N}(0,I),

the KL divergence has a closed form:

KL(qϕ(zx)p(z))=12j(μj2+σj2logσj21). \operatorname{KL} (q_\phi(z\mid x)\|p(z)) = \frac{1}{2} \sum_j \left( \mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right).

This term pushes μ\mu toward zero and σ2\sigma^2 toward one. It prevents the encoder from placing each example in an isolated latent region. This makes sampling possible: if the decoder has learned to reconstruct from latent codes close to the prior, then new latent samples from p(z)p(z) can produce meaningful outputs.

The Reparameterization Trick

Sampling from qϕ(zx)q_\phi(z\mid x) creates a problem for gradient descent. We need gradients to flow through the sampling operation into the encoder parameters.

The reparameterization trick rewrites the sample as a deterministic function of the encoder outputs and external noise.

Instead of sampling

zN(μ,σ2I), z \sim \mathcal{N}(\mu, \sigma^2 I),

we sample

ϵN(0,I) \epsilon \sim \mathcal{N}(0,I)

and compute

z=μ+σϵ. z = \mu + \sigma \odot \epsilon.

Now randomness enters through ϵ\epsilon, which is independent of μ\mu and σ\sigma. Gradients can flow through the deterministic computation of zz.

In PyTorch:

def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)
    eps = torch.randn_like(std)
    return mu + std * eps

The encoder usually outputs logvar rather than std directly because log variance is numerically more stable.

VAE Architecture in PyTorch

A simple VAE for flattened 28×2828 \times 28 images can be written as follows:

import torch
from torch import nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim: int = 784, latent_dim: int = 32):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
        )

        self.mu = nn.Linear(256, latent_dim)
        self.logvar = nn.Linear(256, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim),
            nn.Sigmoid(),
        )

    def encode(self, x: torch.Tensor):
        h = self.encoder(x)
        return self.mu(h), self.logvar(h)

    def reparameterize(
        self,
        mu: torch.Tensor,
        logvar: torch.Tensor,
    ) -> torch.Tensor:
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + std * eps

    def decode(self, z: torch.Tensor):
        return self.decoder(z)

    def forward(self, x: torch.Tensor):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        x_hat = self.decode(z)
        return x_hat, mu, logvar

This model has four conceptual pieces:

ComponentRole
Encoder trunkExtracts features from input
Mean headOutputs μϕ(x)\mu_\phi(x)
Log-variance headOutputs logσϕ2(x)\log\sigma_\phi^2(x)
DecoderReconstructs data from sampled zz

VAE Loss in PyTorch

The VAE loss combines reconstruction loss and KL loss.

For normalized binary-like images, binary cross-entropy is common:

def vae_loss(x_hat, x, mu, logvar):
    recon = F.binary_cross_entropy(
        x_hat,
        x,
        reduction="sum",
    )

    kl = -0.5 * torch.sum(
        1 + logvar - mu.pow(2) - logvar.exp()
    )

    return recon + kl

The KL expression is equivalent to

12j(μj2+σj2logσj21). \frac{1}{2} \sum_j (\mu_j^2 + \sigma_j^2 - \log\sigma_j^2 - 1).

The code uses logvar.exp() to recover σ2\sigma^2.

A training step:

model = VAE(input_dim=784, latent_dim=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.rand(128, 784)

x_hat, mu, logvar = model(x)
loss = vae_loss(x_hat, x, mu, logvar)

optimizer.zero_grad()
loss.backward()
optimizer.step()

For reporting, it is useful to track reconstruction loss and KL loss separately. This helps diagnose collapse or over-regularization.

Sampling from a VAE

After training, a VAE can generate new samples by drawing from the prior:

zN(0,I) z \sim \mathcal{N}(0,I)

and decoding:

xpθ(xz). x \sim p_\theta(x\mid z).

In code:

model.eval()

with torch.no_grad():
    z = torch.randn(16, 32)
    samples = model.decode(z)

The tensor samples contains generated outputs.

This is the main difference between a standard autoencoder and a VAE. A standard autoencoder may reconstruct well, but its latent space may have gaps. Random latent vectors may decode to invalid outputs. A VAE regularizes the latent space so that samples from the prior are likely to decode into valid data.

Latent Interpolation

VAEs often produce smooth latent spaces. Given two examples xax_a and xbx_b, we can encode them into latent means:

μa=μϕ(xa),μb=μϕ(xb). \mu_a = \mu_\phi(x_a), \quad \mu_b = \mu_\phi(x_b).

Then interpolate:

z(t)=(1t)μa+tμb,0t1. z(t) = (1-t)\mu_a + t\mu_b, \quad 0 \le t \le 1.

Decoding z(t)z(t) should produce a smooth transition between the two examples.

In PyTorch:

def interpolate(model, x_a, x_b, steps: int = 10):
    model.eval()

    with torch.no_grad():
        mu_a, _ = model.encode(x_a)
        mu_b, _ = model.encode(x_b)

        outputs = []

        for i in range(steps):
            t = i / (steps - 1)
            z = (1 - t) * mu_a + t * mu_b
            x_hat = model.decode(z)
            outputs.append(x_hat)

        return torch.cat(outputs, dim=0)

Interpolation is useful for inspecting the geometry of the latent space. Smooth transitions indicate that nearby latent points decode to related outputs.

The Reconstruction and Regularization Tradeoff

The VAE objective contains a tradeoff.

The reconstruction term wants the latent variable to preserve detailed information about the input. The KL term wants the approximate posterior to stay close to the prior. If the KL term is too strong, the latent code may contain too little information. If the reconstruction term dominates, the latent space may become poorly organized for sampling.

This tradeoff is often controlled by a coefficient:

L=Lrecon+βLKL. L = L_{\text{recon}} + \beta L_{\text{KL}}.

This model is called a beta-VAE when β\beta is treated as an explicit hyperparameter.

If

β>1, \beta > 1,

the model puts more pressure on the latent space to match the prior. This can encourage disentangled representations, but it can also reduce reconstruction quality.

If

β<1, \beta < 1,

the model allows the latent code to carry more information, often improving reconstruction but weakening generative sampling.

Posterior Collapse

Posterior collapse occurs when the encoder distribution becomes nearly equal to the prior:

qϕ(zx)p(z). q_\phi(z\mid x) \approx p(z).

Then zz carries little or no information about xx. The decoder ignores the latent variable and learns to generate outputs without using it.

This problem is common when the decoder is very powerful, such as an autoregressive language model. The decoder can model the data directly and has little incentive to use zz.

Symptoms include:

SymptomMeaning
KL loss near zeroLatent code matches prior too closely
Similar reconstructions for different inputsEncoder carries little information
Decoder ignores changes in zzLatent space is unused

Common mitigation strategies include KL annealing, weaker decoders, skip connections, free bits, and changing the training objective.

KL annealing gradually increases the KL coefficient during training:

L=Lrecon+β(t)LKL, L = L_{\text{recon}} + \beta(t)L_{\text{KL}},

where β(t)\beta(t) starts near zero and increases toward one.

Disentanglement

A disentangled representation separates independent factors of variation into different latent coordinates.

For example, in an image dataset, different latent dimensions might represent:

Latent coordinatePossible factor
z1z_1Rotation
z2z_2Stroke thickness
z3z_3Object size
z4z_4Lighting
z5z_5Background

Beta-VAEs are often studied for disentanglement because stronger KL pressure can encourage simpler, more factorized latent representations.

However, disentanglement is difficult to define and measure. It often depends on the dataset, inductive biases, supervision, and evaluation procedure. Unsupervised disentanglement has fundamental limitations without assumptions about the data-generating process.

Conditional VAEs

A conditional VAE models data conditioned on an observed variable yy, such as a class label.

The generative model becomes

pθ(xz,y). p_\theta(x\mid z,y).

The encoder may also condition on yy:

qϕ(zx,y). q_\phi(z\mid x,y).

The decoder receives both zz and yy, allowing class-controlled generation.

For example, on a digit dataset, yy may specify the desired digit class. Sampling proceeds by choosing a class yy, drawing

zN(0,I), z \sim \mathcal{N}(0,I),

and decoding from (z,y)(z,y).

Conditional VAEs are useful when generation should be controlled by labels, attributes, text, or other side information.

Limits of VAEs

VAEs are elegant and stable to train, but they have limitations.

First, reconstructions may be blurry when using simple Gaussian likelihoods or pixelwise losses.

Second, the prior may be too simple. A standard normal prior may not match the true aggregate latent distribution well.

Third, posterior collapse can make the latent code unused.

Fourth, VAEs may produce lower visual fidelity than GANs or diffusion models on image generation tasks.

Fifth, likelihood and sample quality may not align. A model can assign reasonable likelihood while producing poor-looking samples.

These limitations led to many extensions, including hierarchical VAEs, normalizing-flow priors, vector-quantized VAEs, diffusion decoders, and hybrid models.

Summary

A variational autoencoder is a probabilistic autoencoder. The encoder approximates a posterior distribution over latent variables. The decoder models data generation from those latent variables.

The VAE objective combines reconstruction quality with a KL penalty that regularizes the latent distribution toward a prior. The reparameterization trick allows stochastic latent sampling while preserving gradient-based training.

VAEs provide a principled bridge between autoencoders and probabilistic generative modeling. They remain important for latent representation learning, controlled generation, uncertainty-aware modeling, and as components inside larger generative systems.