# Variational Autoencoders

A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks. Like an ordinary autoencoder, it has an encoder and a decoder. Unlike an ordinary autoencoder, it treats the latent representation as a random variable.

The goal is not only to compress and reconstruct data. The goal is to learn a latent probability model that can generate new samples.

An ordinary autoencoder learns a deterministic code:

$$
z = f_\theta(x).
$$

A variational autoencoder learns a distribution over codes:

$$
q_\phi(z \mid x).
$$

The decoder then models the probability of data given a latent variable:

$$
p_\theta(x \mid z).
$$

This probabilistic formulation makes VAEs useful for generation, interpolation, uncertainty estimation, and representation learning.

### Latent Variable Models

A latent variable model assumes that observed data $x$ is generated from an unobserved variable $z$.

The generative story is:

1. Sample a latent vector from a prior distribution:

$$
z \sim p(z).
$$

2. Generate data from a conditional distribution:

$$
x \sim p_\theta(x \mid z).
$$

A common prior is the standard normal distribution:

$$
p(z) = \mathcal{N}(0, I).
$$

The decoder is a neural network that maps $z$ to the parameters of a distribution over $x$. For real-valued data, this may be a Gaussian distribution. For binary data, this may be a Bernoulli distribution.

The marginal likelihood of an observed example is

$$
p_\theta(x) =
\int p_\theta(x \mid z)p(z)\,dz.
$$

This integral is usually intractable for neural decoders. VAEs solve this using approximate inference.

### Approximate Posterior

The true posterior distribution over latent variables is

$$
p_\theta(z \mid x) =
\frac{p_\theta(x \mid z)p(z)}{p_\theta(x)}.
$$

This tells us which latent variables are likely to have generated the observed input $x$. But computing it requires the marginal likelihood $p_\theta(x)$, which contains an intractable integral.

A VAE introduces an encoder distribution

$$
q_\phi(z \mid x)
$$

to approximate the true posterior. This is called the variational posterior or approximate posterior.

A common choice is a diagonal Gaussian:

$$
q_\phi(z \mid x) =
\mathcal{N}
\left(
z;
\mu_\phi(x),
\operatorname{diag}(\sigma_\phi^2(x))
\right).
$$

The encoder network outputs two vectors:

$$
\mu_\phi(x)
$$

and

$$
\log \sigma_\phi^2(x).
$$

These parameterize the distribution over latent codes.

### The Evidence Lower Bound

Training a VAE maximizes a lower bound on the log likelihood of the data. This lower bound is called the evidence lower bound, or ELBO.

For one example $x$, the ELBO is

$$
\mathcal{L}(\theta,\phi;x) =
\mathbb{E}_{z\sim q_\phi(z\mid x)}
[
\log p_\theta(x\mid z)
] -
\operatorname{KL}
\left(
q_\phi(z\mid x)
\| p(z)
\right).
$$

The first term is the reconstruction term. It encourages the decoder to reconstruct $x$ from latent samples $z$.

The second term is the regularization term. It encourages the approximate posterior $q_\phi(z\mid x)$ to stay close to the prior $p(z)$.

Training usually minimizes the negative ELBO:

$$
L_{\text{VAE}} = -
\mathcal{L}(\theta,\phi;x).
$$

So the loss can be written as

$$
L_{\text{VAE}} =
L_{\text{recon}}
+
L_{\text{KL}}.
$$

This form looks similar to a regularized autoencoder, but the interpretation is probabilistic.

### Reconstruction Term

The reconstruction term depends on the likelihood model.

For binary data, we may use a Bernoulli likelihood:

$$
p_\theta(x\mid z) =
\prod_i
\operatorname{Bernoulli}(x_i;\hat{x}_i).
$$

The negative log likelihood becomes binary cross-entropy.

For real-valued data, we may use a Gaussian likelihood:

$$
p_\theta(x\mid z) =
\mathcal{N}
(x; \hat{x}, \sigma^2 I).
$$

If $\sigma^2$ is fixed, the negative log likelihood is proportional to mean squared error:

$$
\|x-\hat{x}\|^2.
$$

Thus, VAEs often use reconstruction losses that look familiar from ordinary autoencoders. The difference is that reconstruction is performed from a sampled latent variable.

### KL Divergence Term

For the common case

$$
q_\phi(z\mid x) =
\mathcal{N}
(\mu, \operatorname{diag}(\sigma^2))
$$

and

$$
p(z)=\mathcal{N}(0,I),
$$

the KL divergence has a closed form:

$$
\operatorname{KL}
(q_\phi(z\mid x)\|p(z)) =
\frac{1}{2}
\sum_j
\left(
\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1
\right).
$$

This term pushes $\mu$ toward zero and $\sigma^2$ toward one. It prevents the encoder from placing each example in an isolated latent region. This makes sampling possible: if the decoder has learned to reconstruct from latent codes close to the prior, then new latent samples from $p(z)$ can produce meaningful outputs.

### The Reparameterization Trick

Sampling from $q_\phi(z\mid x)$ creates a problem for gradient descent. We need gradients to flow through the sampling operation into the encoder parameters.

The reparameterization trick rewrites the sample as a deterministic function of the encoder outputs and external noise.

Instead of sampling

$$
z \sim \mathcal{N}(\mu, \sigma^2 I),
$$

we sample

$$
\epsilon \sim \mathcal{N}(0,I)
$$

and compute

$$
z = \mu + \sigma \odot \epsilon.
$$

Now randomness enters through $\epsilon$, which is independent of $\mu$ and $\sigma$. Gradients can flow through the deterministic computation of $z$.

In PyTorch:

```python
def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)
    eps = torch.randn_like(std)
    return mu + std * eps
```

The encoder usually outputs `logvar` rather than `std` directly because log variance is numerically more stable.

### VAE Architecture in PyTorch

A simple VAE for flattened $28 \times 28$ images can be written as follows:

```python
import torch
from torch import nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim: int = 784, latent_dim: int = 32):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
        )

        self.mu = nn.Linear(256, latent_dim)
        self.logvar = nn.Linear(256, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim),
            nn.Sigmoid(),
        )

    def encode(self, x: torch.Tensor):
        h = self.encoder(x)
        return self.mu(h), self.logvar(h)

    def reparameterize(
        self,
        mu: torch.Tensor,
        logvar: torch.Tensor,
    ) -> torch.Tensor:
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + std * eps

    def decode(self, z: torch.Tensor):
        return self.decoder(z)

    def forward(self, x: torch.Tensor):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        x_hat = self.decode(z)
        return x_hat, mu, logvar
```

This model has four conceptual pieces:

| Component | Role |
|---|---|
| Encoder trunk | Extracts features from input |
| Mean head | Outputs $\mu_\phi(x)$ |
| Log-variance head | Outputs $\log\sigma_\phi^2(x)$ |
| Decoder | Reconstructs data from sampled $z$ |

### VAE Loss in PyTorch

The VAE loss combines reconstruction loss and KL loss.

For normalized binary-like images, binary cross-entropy is common:

```python
def vae_loss(x_hat, x, mu, logvar):
    recon = F.binary_cross_entropy(
        x_hat,
        x,
        reduction="sum",
    )

    kl = -0.5 * torch.sum(
        1 + logvar - mu.pow(2) - logvar.exp()
    )

    return recon + kl
```

The KL expression is equivalent to

$$
\frac{1}{2}
\sum_j
(\mu_j^2 + \sigma_j^2 - \log\sigma_j^2 - 1).
$$

The code uses `logvar.exp()` to recover $\sigma^2$.

A training step:

```python
model = VAE(input_dim=784, latent_dim=32)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.rand(128, 784)

x_hat, mu, logvar = model(x)
loss = vae_loss(x_hat, x, mu, logvar)

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

For reporting, it is useful to track reconstruction loss and KL loss separately. This helps diagnose collapse or over-regularization.

### Sampling from a VAE

After training, a VAE can generate new samples by drawing from the prior:

$$
z \sim \mathcal{N}(0,I)
$$

and decoding:

$$
x \sim p_\theta(x\mid z).
$$

In code:

```python
model.eval()

with torch.no_grad():
    z = torch.randn(16, 32)
    samples = model.decode(z)
```

The tensor `samples` contains generated outputs.

This is the main difference between a standard autoencoder and a VAE. A standard autoencoder may reconstruct well, but its latent space may have gaps. Random latent vectors may decode to invalid outputs. A VAE regularizes the latent space so that samples from the prior are likely to decode into valid data.

### Latent Interpolation

VAEs often produce smooth latent spaces. Given two examples $x_a$ and $x_b$, we can encode them into latent means:

$$
\mu_a = \mu_\phi(x_a),
\quad
\mu_b = \mu_\phi(x_b).
$$

Then interpolate:

$$
z(t) = (1-t)\mu_a + t\mu_b,
\quad 0 \le t \le 1.
$$

Decoding $z(t)$ should produce a smooth transition between the two examples.

In PyTorch:

```python
def interpolate(model, x_a, x_b, steps: int = 10):
    model.eval()

    with torch.no_grad():
        mu_a, _ = model.encode(x_a)
        mu_b, _ = model.encode(x_b)

        outputs = []

        for i in range(steps):
            t = i / (steps - 1)
            z = (1 - t) * mu_a + t * mu_b
            x_hat = model.decode(z)
            outputs.append(x_hat)

        return torch.cat(outputs, dim=0)
```

Interpolation is useful for inspecting the geometry of the latent space. Smooth transitions indicate that nearby latent points decode to related outputs.

### The Reconstruction and Regularization Tradeoff

The VAE objective contains a tradeoff.

The reconstruction term wants the latent variable to preserve detailed information about the input. The KL term wants the approximate posterior to stay close to the prior. If the KL term is too strong, the latent code may contain too little information. If the reconstruction term dominates, the latent space may become poorly organized for sampling.

This tradeoff is often controlled by a coefficient:

$$
L =
L_{\text{recon}}
+
\beta L_{\text{KL}}.
$$

This model is called a beta-VAE when $\beta$ is treated as an explicit hyperparameter.

If

$$
\beta > 1,
$$

the model puts more pressure on the latent space to match the prior. This can encourage disentangled representations, but it can also reduce reconstruction quality.

If

$$
\beta < 1,
$$

the model allows the latent code to carry more information, often improving reconstruction but weakening generative sampling.

### Posterior Collapse

Posterior collapse occurs when the encoder distribution becomes nearly equal to the prior:

$$
q_\phi(z\mid x) \approx p(z).
$$

Then $z$ carries little or no information about $x$. The decoder ignores the latent variable and learns to generate outputs without using it.

This problem is common when the decoder is very powerful, such as an autoregressive language model. The decoder can model the data directly and has little incentive to use $z$.

Symptoms include:

| Symptom | Meaning |
|---|---|
| KL loss near zero | Latent code matches prior too closely |
| Similar reconstructions for different inputs | Encoder carries little information |
| Decoder ignores changes in $z$ | Latent space is unused |

Common mitigation strategies include KL annealing, weaker decoders, skip connections, free bits, and changing the training objective.

KL annealing gradually increases the KL coefficient during training:

$$
L =
L_{\text{recon}}
+
\beta(t)L_{\text{KL}},
$$

where $\beta(t)$ starts near zero and increases toward one.

### Disentanglement

A disentangled representation separates independent factors of variation into different latent coordinates.

For example, in an image dataset, different latent dimensions might represent:

| Latent coordinate | Possible factor |
|---|---|
| $z_1$ | Rotation |
| $z_2$ | Stroke thickness |
| $z_3$ | Object size |
| $z_4$ | Lighting |
| $z_5$ | Background |

Beta-VAEs are often studied for disentanglement because stronger KL pressure can encourage simpler, more factorized latent representations.

However, disentanglement is difficult to define and measure. It often depends on the dataset, inductive biases, supervision, and evaluation procedure. Unsupervised disentanglement has fundamental limitations without assumptions about the data-generating process.

### Conditional VAEs

A conditional VAE models data conditioned on an observed variable $y$, such as a class label.

The generative model becomes

$$
p_\theta(x\mid z,y).
$$

The encoder may also condition on $y$:

$$
q_\phi(z\mid x,y).
$$

The decoder receives both $z$ and $y$, allowing class-controlled generation.

For example, on a digit dataset, $y$ may specify the desired digit class. Sampling proceeds by choosing a class $y$, drawing

$$
z \sim \mathcal{N}(0,I),
$$

and decoding from $(z,y)$.

Conditional VAEs are useful when generation should be controlled by labels, attributes, text, or other side information.

### Limits of VAEs

VAEs are elegant and stable to train, but they have limitations.

First, reconstructions may be blurry when using simple Gaussian likelihoods or pixelwise losses.

Second, the prior may be too simple. A standard normal prior may not match the true aggregate latent distribution well.

Third, posterior collapse can make the latent code unused.

Fourth, VAEs may produce lower visual fidelity than GANs or diffusion models on image generation tasks.

Fifth, likelihood and sample quality may not align. A model can assign reasonable likelihood while producing poor-looking samples.

These limitations led to many extensions, including hierarchical VAEs, normalizing-flow priors, vector-quantized VAEs, diffusion decoders, and hybrid models.

### Summary

A variational autoencoder is a probabilistic autoencoder. The encoder approximates a posterior distribution over latent variables. The decoder models data generation from those latent variables.

The VAE objective combines reconstruction quality with a KL penalty that regularizes the latent distribution toward a prior. The reparameterization trick allows stochastic latent sampling while preserving gradient-based training.

VAEs provide a principled bridge between autoencoders and probabilistic generative modeling. They remain important for latent representation learning, controlled generation, uncertainty-aware modeling, and as components inside larger generative systems.