# Latent Diffusion

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

```python id="e4z5zr"
[B, 3, 512, 512]
```

where $B$ is the batch size and the remaining dimensions represent RGB images.

Although these models produced high-quality outputs, they were computationally expensive. Every denoising step required neural network computation over large high-resolution tensors. Training and inference therefore consumed large amounts of memory, compute, and time.

Latent diffusion addresses this problem by moving the diffusion process into a compressed latent representation. Instead of diffusing pixels, the model diffuses latent tensors produced by an autoencoder.

This idea became foundational in modern text-to-image systems such as entity["product","Stable Diffusion","latent diffusion text-to-image model"].

### Motivation for Latent Diffusion

Pixel-space diffusion is expensive for several reasons.

First, image tensors are large. A $512\times512$ RGB image contains:

$$
3\times512\times512 =
786{,}432
$$

values.

Second, diffusion requires many denoising steps. Each step runs a large neural network over the full spatial resolution.

Third, much of the pixel information is locally redundant. Neighboring pixels often contain highly correlated structure.

The key observation is that many image details are compressible. Instead of modeling raw pixels directly, we can learn a lower-dimensional latent space that preserves semantic structure.

The diffusion process then operates on compressed latent representations:

| Space | Example shape |
|---|---|
| Pixel space | `[B, 3, 512, 512]` |
| Latent space | `[B, 4, 64, 64]` |

This dramatically reduces computation.

### Autoencoder Compression

Latent diffusion uses an encoder-decoder architecture.

The encoder maps images into latent tensors:

$$
z_0 = \mathcal{E}(x_0).
$$

The decoder reconstructs images from latent representations:

$$
\hat{x}_0 = \mathcal{D}(z_0).
$$

Here:

| Symbol | Meaning |
|---|---|
| $\mathcal{E}$ | Encoder |
| $\mathcal{D}$ | Decoder |
| $x_0$ | Original image |
| $z_0$ | Latent representation |

The encoder compresses the image into a lower-dimensional representation while preserving visually important information.

The diffusion model operates entirely on $z_0$, not on $x_0$.

### Variational Autoencoder Foundations

Most latent diffusion systems use a variational autoencoder-like structure.

The encoder predicts a latent distribution:

$$
q(z\mid x).
$$

Typically:

$$
q(z\mid x) =
\mathcal{N}
\left(
z;
\mu(x),
\sigma(x)^2 I
\right).
$$

The latent is sampled using the reparameterization trick:

$$
z =
\mu(x)
+
\sigma(x)\epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).
$$

The decoder reconstructs the image:

$$
\hat{x} =
\mathcal{D}(z).
$$

Training minimizes:

$$
\mathcal{L} =
\mathcal{L}_\text{recon}
+
\lambda
D_{\mathrm{KL}}
\left(
q(z\mid x)\|p(z)
\right).
$$

The KL regularization encourages the latent space to remain approximately Gaussian and well-structured.

### Diffusion in Latent Space

Once the encoder-decoder system is trained, the diffusion process operates on latent tensors:

$$
z_t =
\sqrt{\bar{\alpha}_t}z_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

This equation is identical to pixel-space diffusion. The only difference is that the variables now represent latent tensors rather than pixel tensors.

The reverse process learns:

$$
p_\theta(z_{t-1}\mid z_t,c),
$$

where $c$ may represent text conditioning or other guidance information.

Sampling proceeds as:

$$
z_T
\rightarrow
z_{T-1}
\rightarrow
\cdots
\rightarrow
z_0.
$$

The decoder then converts the final latent into an image:

$$
x_0 =
\mathcal{D}(z_0).
$$

### Compression Ratios

Latent spaces are usually spatially compressed.

For example, an encoder may reduce a $512\times512$ image to a $64\times64$ latent representation.

This corresponds to an 8x reduction along each spatial dimension:

$$
512 / 64 = 8.
$$

The total compression factor becomes:

$$
8\times8 = 64.
$$

If the latent tensor uses 4 channels instead of 3 RGB channels, then the total representation size becomes:

$$
4\times64\times64 = 16{,}384.
$$

Compare this with pixel space:

$$
3\times512\times512 = 786{,}432.
$$

The latent representation is therefore dramatically smaller.

| Representation | Number of values |
|---|---:|
| Pixel image | 786,432 |
| Latent tensor | 16,384 |

This reduction makes diffusion much cheaper.

### Why Latent Diffusion Works

A good latent representation separates semantic information from pixel redundancy.

The encoder learns to preserve:

| Preserved information | Examples |
|---|---|
| Object structure | Faces, cars, buildings |
| Spatial layout | Relative positions |
| Global semantics | Scene identity |
| Important textures | Materials and edges |

The encoder discards:

| Reduced information | Examples |
|---|---|
| High-frequency noise | Pixel-level randomness |
| Redundant detail | Similar neighboring pixels |
| Compression-insensitive features | Imperceptible variation |

The diffusion model therefore focuses on modeling semantic structure rather than low-level pixel statistics.

### Architecture of Latent Diffusion Models

A latent diffusion system typically contains three major components.

| Component | Purpose |
|---|---|
| Autoencoder | Compress and reconstruct images |
| Diffusion U-Net | Perform denoising in latent space |
| Conditioning model | Encode prompts or other guidance |

The workflow becomes:

$$
x_0
\rightarrow
z_0
\rightarrow
z_t
\rightarrow
\hat{z}_0
\rightarrow
\hat{x}_0.
$$

The diffusion model itself never directly processes full-resolution images.

### Cross-Attention Conditioning

Modern latent diffusion systems condition generation on text.

Suppose a text encoder produces embeddings:

$$
c = \mathrm{TextEncoder}(y),
$$

where $y$ is the prompt.

The denoising model predicts:

$$
\epsilon_\theta(z_t,t,c).
$$

Cross-attention allows latent features to attend to text embeddings.

The attention mechanism computes:

$$
\mathrm{Attention}(Q,K,V) =
\mathrm{softmax}
\left(
\frac{QK^\top}{\sqrt{d}}
\right)V.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"\\mathrm{Attention}(Q,K,V)=\\mathrm{softmax}\\left(\\frac{QK^\\top}{\\sqrt{d}}\\right)V"}}

Here:

| Symbol | Meaning |
|---|---|
| $Q$ | Queries from latent features |
| $K$ | Keys from text embeddings |
| $V$ | Values from text embeddings |

Cross-attention enables the latent image representation to incorporate prompt semantics during denoising.

### Stable Diffusion Pipeline

A simplified latent diffusion pipeline looks like this:

1. Encode text prompt into embeddings
2. Sample latent Gaussian noise
3. Iteratively denoise latent representation
4. Decode latent tensor into image

Mathematically:

$$
c = \mathrm{TextEncoder}(y),
$$

$$
z_T\sim\mathcal{N}(0,I),
$$

$$
z_{t-1}\sim p_\theta(z_{t-1}\mid z_t,c),
$$

$$
x_0 = \mathcal{D}(z_0).
$$

The latent tensor evolves gradually from noise into structured semantic content.

### Latent Tensor Shapes

In many latent diffusion systems, tensor shapes follow conventions such as:

| Tensor | Shape |
|---|---|
| Image | `[B, 3, 512, 512]` |
| Latent | `[B, 4, 64, 64]` |
| Text embeddings | `[B, T, D]` |

Example:

```python id="0z9pn7"
images = torch.randn(8, 3, 512, 512)

latents = torch.randn(8, 4, 64, 64)

text_embeddings = torch.randn(8, 77, 768)
```

The diffusion U-Net processes latent tensors rather than pixel tensors.

### Training Procedure

Training latent diffusion involves multiple stages.

#### Stage 1: Train the Autoencoder

The encoder-decoder system learns to reconstruct images.

Losses may include:

| Loss | Purpose |
|---|---|
| Reconstruction loss | Pixel fidelity |
| Perceptual loss | Semantic similarity |
| KL loss | Latent regularization |
| Adversarial loss | Sharper outputs |

#### Stage 2: Freeze the Autoencoder

After training, the encoder and decoder are fixed.

#### Stage 3: Train the Diffusion Model

Images are encoded into latents:

$$
z_0=\mathcal{E}(x_0).
$$

Noise is added:

$$
z_t =
\sqrt{\bar{\alpha}_t}z_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

The diffusion model predicts noise:

$$
\epsilon_\theta(z_t,t,c).
$$

The training objective becomes:

$$
\mathcal{L} =
\mathbb{E}
\left[
\|
\epsilon -
\epsilon_\theta(z_t,t,c)
\|_2^2
\right].
$$

### Latent Scaling

Latent representations may have arbitrary variance depending on encoder training.

To stabilize diffusion training, latent vectors are often rescaled:

$$
z'_0 = s z_0,
$$

where $s$ is a constant scaling factor.

For example, some systems normalize latent standard deviation so that diffusion noise schedules behave consistently.

Without proper scaling:

| Problem | Consequence |
|---|---|
| Latents too large | Noise becomes too weak |
| Latents too small | Signal disappears too quickly |
| Inconsistent variance | Training instability |

### Advantages of Latent Diffusion

Latent diffusion provides several major advantages.

| Advantage | Explanation |
|---|---|
| Lower compute cost | Smaller tensors |
| Lower memory usage | Reduced spatial resolution |
| Faster training | Less expensive denoising |
| Faster inference | Smaller U-Net operations |
| Better scalability | Larger images become feasible |
| Semantic modeling | Focus on high-level structure |

This efficiency enabled practical open-source large-scale text-to-image generation.

### Limitations of Latent Diffusion

Latent compression also introduces limitations.

| Limitation | Cause |
|---|---|
| Loss of fine detail | Compression bottleneck |
| Reconstruction artifacts | Imperfect decoder |
| Semantic drift | Encoder information loss |
| Decoder dependence | Final quality limited by decoder |
| Compression bias | Latent space may favor certain textures |

The decoder becomes part of the generative pipeline. Even perfect latent denoising cannot exceed decoder reconstruction quality.

### Pixel Diffusion Versus Latent Diffusion

| Property | Pixel Diffusion | Latent Diffusion |
|---|---|---|
| Operating space | Pixels | Compressed latents |
| Tensor size | Large | Small |
| Compute cost | High | Lower |
| Memory usage | High | Lower |
| Fine detail modeling | Strong | Decoder-limited |
| Scalability | Harder | Easier |
| Sampling speed | Slower | Faster |

Pixel diffusion may preserve fine textures more directly. Latent diffusion is usually more practical for large-scale systems.

### PyTorch Example: Encoding and Diffusion

Suppose an autoencoder is defined as:

```python id="n0j6c6"
encoder = AutoencoderEncoder()
decoder = AutoencoderDecoder()
```

Encode images:

```python id="2qukgq"
images = torch.randn(8, 3, 512, 512)

latents = encoder(images)

print(latents.shape)
# torch.Size([8, 4, 64, 64])
```

Add diffusion noise:

```python id="7l6ol8"
noise = torch.randn_like(latents)

t = torch.randint(0, T, (8,))

alpha_bar_t = extract(alpha_bars, t, latents.shape)

z_t = (
    torch.sqrt(alpha_bar_t) * latents
    +
    torch.sqrt(1 - alpha_bar_t) * noise
)
```

Predict noise:

```python id="62n0c4"
pred_noise = unet(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)
```

Decode generated latent:

```python id="3i5h5h"
generated_images = decoder(latents)
```

This structure is the core workflow of many modern latent diffusion systems.

### Classifier-Free Guidance in Latent Space

Latent diffusion commonly uses classifier-free guidance.

The model predicts:

$$
\epsilon_\theta(z_t,t,c)
$$

and

$$
\epsilon_\theta(z_t,t,\varnothing).
$$

The guided prediction becomes:

$$
\hat{\epsilon} =
\epsilon_\text{uncond}
+
s
(
\epsilon_\text{cond} -
\epsilon_\text{uncond}
).
$$

The guidance scale $s$ controls prompt strength.

| Guidance scale | Effect |
|---|---|
| Small | More diversity |
| Moderate | Better prompt adherence |
| Large | Sharper but less diverse outputs |

Very large guidance scales may produce oversaturated or unstable images.

### Latent Diffusion Beyond Images

The latent diffusion idea generalizes beyond image generation.

Applications include:

| Domain | Latent representation |
|---|---|
| Video generation | Spatiotemporal latent tensors |
| Audio synthesis | Spectrogram or audio latents |
| 3D generation | Geometry or radiance field latents |
| Motion generation | Pose or trajectory latents |
| Molecular generation | Graph or embedding latents |

The key principle remains unchanged:

1. Learn a compressed representation
2. Diffuse in latent space
3. Decode into the original domain

### Why Latent Diffusion Became Dominant

Latent diffusion balanced three competing requirements:

| Requirement | Challenge |
|---|---|
| High visual quality | Requires expressive models |
| Large image resolution | Requires large tensors |
| Practical compute cost | Requires efficiency |

Pixel-space diffusion achieved quality but was expensive. GANs were fast but often unstable. Autoregressive image models scaled poorly with resolution.

Latent diffusion provided a practical compromise:

| Feature | Result |
|---|---|
| Compression | Lower compute |
| Diffusion training | Stable optimization |
| Attention conditioning | Strong prompt control |
| U-Net denoising | High-quality structure generation |

This combination made large-scale open text-to-image systems feasible.

### Summary

Latent diffusion performs diffusion in a compressed latent representation rather than directly in pixel space.

An encoder maps images into latent tensors:

$$
z_0=\mathcal{E}(x_0).
$$

The diffusion process operates on these latent representations:

$$
z_t =
\sqrt{\bar{\alpha}_t}z_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

A denoising network learns the reverse process in latent space. After denoising, a decoder reconstructs the final image.

Latent diffusion greatly reduces computational cost while preserving semantic structure. This architecture became foundational in modern text-to-image generation systems because it combines efficient compression, stable diffusion training, and flexible conditioning mechanisms.

