Latent Diffusion

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

[B, 3, 512, 512]

where $B$ is the batch size and the remaining dimensions represent RGB images.

Although these models produced high-quality outputs, they were computationally expensive. Every denoising step required neural network computation over large high-resolution tensors. Training and inference therefore consumed large amounts of memory, compute, and time.

Latent diffusion addresses this problem by moving the diffusion process into a compressed latent representation. Instead of diffusing pixels, the model diffuses latent tensors produced by an autoencoder.

This idea became foundational in modern text-to-image systems such as entity[“product”,“Stable Diffusion”,“latent diffusion text-to-image model”].

Motivation for Latent Diffusion

Pixel-space diffusion is expensive for several reasons.

First, image tensors are large. A $512\times512$ RGB image contains:

3\times512\times512 = 786{,}432

values.

Second, diffusion requires many denoising steps. Each step runs a large neural network over the full spatial resolution.

Third, much of the pixel information is locally redundant. Neighboring pixels often contain highly correlated structure.

The key observation is that many image details are compressible. Instead of modeling raw pixels directly, we can learn a lower-dimensional latent space that preserves semantic structure.

The diffusion process then operates on compressed latent representations:

Space	Example shape
Pixel space	`[B, 3, 512, 512]`
Latent space	`[B, 4, 64, 64]`

This dramatically reduces computation.

Autoencoder Compression

Latent diffusion uses an encoder-decoder architecture.

The encoder maps images into latent tensors:

z_0 = \mathcal{E}(x_0).

The decoder reconstructs images from latent representations:

\hat{x}_0 = \mathcal{D}(z_0).

Here:

Symbol	Meaning
$\mathcal{E}$	Encoder
$\mathcal{D}$	Decoder
$x_0$	Original image
$z_0$	Latent representation

The encoder compresses the image into a lower-dimensional representation while preserving visually important information.

The diffusion model operates entirely on $z_0$ , not on $x_0$ .

Variational Autoencoder Foundations

Most latent diffusion systems use a variational autoencoder-like structure.

The encoder predicts a latent distribution:

q(z\mid x).

Typically:

q(z\mid x) = \mathcal{N} \left( z; \mu(x), \sigma(x)^2 I \right).

The latent is sampled using the reparameterization trick:

z = \mu(x) + \sigma(x)\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I).

The decoder reconstructs the image:

\hat{x} = \mathcal{D}(z).

Training minimizes:

\mathcal{L} = \mathcal{L}_\text{recon} + \lambda D_{\mathrm{KL}} \left( q(z\mid x)\|p(z) \right).

The KL regularization encourages the latent space to remain approximately Gaussian and well-structured.

Diffusion in Latent Space

Once the encoder-decoder system is trained, the diffusion process operates on latent tensors:

z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

This equation is identical to pixel-space diffusion. The only difference is that the variables now represent latent tensors rather than pixel tensors.

The reverse process learns:

p_\theta(z_{t-1}\mid z_t,c),

where $c$ may represent text conditioning or other guidance information.

Sampling proceeds as:

z_T \rightarrow z_{T-1} \rightarrow \cdots \rightarrow z_0.

The decoder then converts the final latent into an image:

x_0 = \mathcal{D}(z_0).

Compression Ratios

Latent spaces are usually spatially compressed.

For example, an encoder may reduce a $512\times512$ image to a $64\times64$ latent representation.

This corresponds to an 8x reduction along each spatial dimension:

512 / 64 = 8.

The total compression factor becomes:

8\times8 = 64.

If the latent tensor uses 4 channels instead of 3 RGB channels, then the total representation size becomes:

4\times64\times64 = 16{,}384.

Compare this with pixel space:

3\times512\times512 = 786{,}432.

The latent representation is therefore dramatically smaller.

Representation	Number of values
Pixel image	786,432
Latent tensor	16,384

This reduction makes diffusion much cheaper.

Why Latent Diffusion Works

A good latent representation separates semantic information from pixel redundancy.

The encoder learns to preserve:

Preserved information	Examples
Object structure	Faces, cars, buildings
Spatial layout	Relative positions
Global semantics	Scene identity
Important textures	Materials and edges

The encoder discards:

Reduced information	Examples
High-frequency noise	Pixel-level randomness
Redundant detail	Similar neighboring pixels
Compression-insensitive features	Imperceptible variation

The diffusion model therefore focuses on modeling semantic structure rather than low-level pixel statistics.

Architecture of Latent Diffusion Models

A latent diffusion system typically contains three major components.

Component	Purpose
Autoencoder	Compress and reconstruct images
Diffusion U-Net	Perform denoising in latent space
Conditioning model	Encode prompts or other guidance

The workflow becomes:

x_0 \rightarrow z_0 \rightarrow z_t \rightarrow \hat{z}_0 \rightarrow \hat{x}_0.

The diffusion model itself never directly processes full-resolution images.

Cross-Attention Conditioning

Modern latent diffusion systems condition generation on text.

Suppose a text encoder produces embeddings:

c = \mathrm{TextEncoder}(y),

where $y$ is the prompt.

The denoising model predicts:

\epsilon_\theta(z_t,t,c).

Cross-attention allows latent features to attend to text embeddings.

The attention mechanism computes:

\mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}

Here:

Symbol	Meaning
$Q$	Queries from latent features
$K$	Keys from text embeddings
$V$	Values from text embeddings

Cross-attention enables the latent image representation to incorporate prompt semantics during denoising.

Stable Diffusion Pipeline

A simplified latent diffusion pipeline looks like this:

Encode text prompt into embeddings
Sample latent Gaussian noise
Iteratively denoise latent representation
Decode latent tensor into image

Mathematically:

c = \mathrm{TextEncoder}(y),

z_T\sim\mathcal{N}(0,I),

z_{t-1}\sim p_\theta(z_{t-1}\mid z_t,c),

x_0 = \mathcal{D}(z_0).

The latent tensor evolves gradually from noise into structured semantic content.

Latent Tensor Shapes

In many latent diffusion systems, tensor shapes follow conventions such as:

Tensor	Shape
Image	`[B, 3, 512, 512]`
Latent	`[B, 4, 64, 64]`
Text embeddings	`[B, T, D]`

Example:

images = torch.randn(8, 3, 512, 512)

latents = torch.randn(8, 4, 64, 64)

text_embeddings = torch.randn(8, 77, 768)

The diffusion U-Net processes latent tensors rather than pixel tensors.

Training Procedure

Training latent diffusion involves multiple stages.

Stage 1: Train the Autoencoder

The encoder-decoder system learns to reconstruct images.

Losses may include:

Loss	Purpose
Reconstruction loss	Pixel fidelity
Perceptual loss	Semantic similarity
KL loss	Latent regularization
Adversarial loss	Sharper outputs

Stage 2: Freeze the Autoencoder

After training, the encoder and decoder are fixed.

Stage 3: Train the Diffusion Model

Images are encoded into latents:

z_0=\mathcal{E}(x_0).

Noise is added:

z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

The diffusion model predicts noise:

\epsilon_\theta(z_t,t,c).

The training objective becomes:

\mathcal{L} = \mathbb{E} \left[ \| \epsilon - \epsilon_\theta(z_t,t,c) \|_2^2 \right].

Latent Scaling

Latent representations may have arbitrary variance depending on encoder training.

To stabilize diffusion training, latent vectors are often rescaled:

z'_0 = s z_0,

where $s$ is a constant scaling factor.

For example, some systems normalize latent standard deviation so that diffusion noise schedules behave consistently.

Without proper scaling:

Problem	Consequence
Latents too large	Noise becomes too weak
Latents too small	Signal disappears too quickly
Inconsistent variance	Training instability

Advantages of Latent Diffusion

Latent diffusion provides several major advantages.

Advantage	Explanation
Lower compute cost	Smaller tensors
Lower memory usage	Reduced spatial resolution
Faster training	Less expensive denoising
Faster inference	Smaller U-Net operations
Better scalability	Larger images become feasible
Semantic modeling	Focus on high-level structure

This efficiency enabled practical open-source large-scale text-to-image generation.

Limitations of Latent Diffusion

Latent compression also introduces limitations.

Limitation	Cause
Loss of fine detail	Compression bottleneck
Reconstruction artifacts	Imperfect decoder
Semantic drift	Encoder information loss
Decoder dependence	Final quality limited by decoder
Compression bias	Latent space may favor certain textures

The decoder becomes part of the generative pipeline. Even perfect latent denoising cannot exceed decoder reconstruction quality.

Pixel Diffusion Versus Latent Diffusion

Property	Pixel Diffusion	Latent Diffusion
Operating space	Pixels	Compressed latents
Tensor size	Large	Small
Compute cost	High	Lower
Memory usage	High	Lower
Fine detail modeling	Strong	Decoder-limited
Scalability	Harder	Easier
Sampling speed	Slower	Faster

Pixel diffusion may preserve fine textures more directly. Latent diffusion is usually more practical for large-scale systems.

PyTorch Example: Encoding and Diffusion

Suppose an autoencoder is defined as:

encoder = AutoencoderEncoder()
decoder = AutoencoderDecoder()

Encode images:

images = torch.randn(8, 3, 512, 512)

latents = encoder(images)

print(latents.shape)
# torch.Size([8, 4, 64, 64])

Add diffusion noise:

noise = torch.randn_like(latents)

t = torch.randint(0, T, (8,))

alpha_bar_t = extract(alpha_bars, t, latents.shape)

z_t = (
    torch.sqrt(alpha_bar_t) * latents
    +
    torch.sqrt(1 - alpha_bar_t) * noise
)

Predict noise:

pred_noise = unet(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

Decode generated latent:

generated_images = decoder(latents)

This structure is the core workflow of many modern latent diffusion systems.

Classifier-Free Guidance in Latent Space

Latent diffusion commonly uses classifier-free guidance.

The model predicts:

\epsilon_\theta(z_t,t,c)

and

\epsilon_\theta(z_t,t,\varnothing).

The guided prediction becomes:

\hat{\epsilon} = \epsilon_\text{uncond} + s ( \epsilon_\text{cond} - \epsilon_\text{uncond} ).

The guidance scale $s$ controls prompt strength.

Guidance scale	Effect
Small	More diversity
Moderate	Better prompt adherence
Large	Sharper but less diverse outputs

Very large guidance scales may produce oversaturated or unstable images.

Latent Diffusion Beyond Images

The latent diffusion idea generalizes beyond image generation.

Applications include:

Domain	Latent representation
Video generation	Spatiotemporal latent tensors
Audio synthesis	Spectrogram or audio latents
3D generation	Geometry or radiance field latents
Motion generation	Pose or trajectory latents
Molecular generation	Graph or embedding latents

The key principle remains unchanged:

Learn a compressed representation
Diffuse in latent space
Decode into the original domain

Why Latent Diffusion Became Dominant

Latent diffusion balanced three competing requirements:

Requirement	Challenge
High visual quality	Requires expressive models
Large image resolution	Requires large tensors
Practical compute cost	Requires efficiency

Pixel-space diffusion achieved quality but was expensive. GANs were fast but often unstable. Autoregressive image models scaled poorly with resolution.

Latent diffusion provided a practical compromise:

Feature	Result
Compression	Lower compute
Diffusion training	Stable optimization
Attention conditioning	Strong prompt control
U-Net denoising	High-quality structure generation

This combination made large-scale open text-to-image systems feasible.

Summary

Latent diffusion performs diffusion in a compressed latent representation rather than directly in pixel space.

An encoder maps images into latent tensors:

z_0=\mathcal{E}(x_0).

The diffusion process operates on these latent representations:

z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

A denoising network learns the reverse process in latent space. After denoising, a decoder reconstructs the final image.

Latent diffusion greatly reduces computational cost while preserving semantic structure. This architecture became foundational in modern text-to-image generation systems because it combines efficient compression, stable diffusion training, and flexible conditioning mechanisms.