Skip to content

Latent Diffusion

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

[B, 3, 512, 512]

where BB is the batch size and the remaining dimensions represent RGB images.

Although these models produced high-quality outputs, they were computationally expensive. Every denoising step required neural network computation over large high-resolution tensors. Training and inference therefore consumed large amounts of memory, compute, and time.

Latent diffusion addresses this problem by moving the diffusion process into a compressed latent representation. Instead of diffusing pixels, the model diffuses latent tensors produced by an autoencoder.

This idea became foundational in modern text-to-image systems such as entity[“product”,“Stable Diffusion”,“latent diffusion text-to-image model”].

Motivation for Latent Diffusion

Pixel-space diffusion is expensive for several reasons.

First, image tensors are large. A 512×512512\times512 RGB image contains:

3×512×512=786,432 3\times512\times512 = 786{,}432

values.

Second, diffusion requires many denoising steps. Each step runs a large neural network over the full spatial resolution.

Third, much of the pixel information is locally redundant. Neighboring pixels often contain highly correlated structure.

The key observation is that many image details are compressible. Instead of modeling raw pixels directly, we can learn a lower-dimensional latent space that preserves semantic structure.

The diffusion process then operates on compressed latent representations:

SpaceExample shape
Pixel space[B, 3, 512, 512]
Latent space[B, 4, 64, 64]

This dramatically reduces computation.

Autoencoder Compression

Latent diffusion uses an encoder-decoder architecture.

The encoder maps images into latent tensors:

z0=E(x0). z_0 = \mathcal{E}(x_0).

The decoder reconstructs images from latent representations:

x^0=D(z0). \hat{x}_0 = \mathcal{D}(z_0).

Here:

SymbolMeaning
E\mathcal{E}Encoder
D\mathcal{D}Decoder
x0x_0Original image
z0z_0Latent representation

The encoder compresses the image into a lower-dimensional representation while preserving visually important information.

The diffusion model operates entirely on z0z_0, not on x0x_0.

Variational Autoencoder Foundations

Most latent diffusion systems use a variational autoencoder-like structure.

The encoder predicts a latent distribution:

q(zx). q(z\mid x).

Typically:

q(zx)=N(z;μ(x),σ(x)2I). q(z\mid x) = \mathcal{N} \left( z; \mu(x), \sigma(x)^2 I \right).

The latent is sampled using the reparameterization trick:

z=μ(x)+σ(x)ϵ,ϵN(0,I). z = \mu(x) + \sigma(x)\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I).

The decoder reconstructs the image:

x^=D(z). \hat{x} = \mathcal{D}(z).

Training minimizes:

L=Lrecon+λDKL(q(zx)p(z)). \mathcal{L} = \mathcal{L}_\text{recon} + \lambda D_{\mathrm{KL}} \left( q(z\mid x)\|p(z) \right).

The KL regularization encourages the latent space to remain approximately Gaussian and well-structured.

Diffusion in Latent Space

Once the encoder-decoder system is trained, the diffusion process operates on latent tensors:

zt=αˉtz0+1αˉtϵ. z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

This equation is identical to pixel-space diffusion. The only difference is that the variables now represent latent tensors rather than pixel tensors.

The reverse process learns:

pθ(zt1zt,c), p_\theta(z_{t-1}\mid z_t,c),

where cc may represent text conditioning or other guidance information.

Sampling proceeds as:

zTzT1z0. z_T \rightarrow z_{T-1} \rightarrow \cdots \rightarrow z_0.

The decoder then converts the final latent into an image:

x0=D(z0). x_0 = \mathcal{D}(z_0).

Compression Ratios

Latent spaces are usually spatially compressed.

For example, an encoder may reduce a 512×512512\times512 image to a 64×6464\times64 latent representation.

This corresponds to an 8x reduction along each spatial dimension:

512/64=8. 512 / 64 = 8.

The total compression factor becomes:

8×8=64. 8\times8 = 64.

If the latent tensor uses 4 channels instead of 3 RGB channels, then the total representation size becomes:

4×64×64=16,384. 4\times64\times64 = 16{,}384.

Compare this with pixel space:

3×512×512=786,432. 3\times512\times512 = 786{,}432.

The latent representation is therefore dramatically smaller.

RepresentationNumber of values
Pixel image786,432
Latent tensor16,384

This reduction makes diffusion much cheaper.

Why Latent Diffusion Works

A good latent representation separates semantic information from pixel redundancy.

The encoder learns to preserve:

Preserved informationExamples
Object structureFaces, cars, buildings
Spatial layoutRelative positions
Global semanticsScene identity
Important texturesMaterials and edges

The encoder discards:

Reduced informationExamples
High-frequency noisePixel-level randomness
Redundant detailSimilar neighboring pixels
Compression-insensitive featuresImperceptible variation

The diffusion model therefore focuses on modeling semantic structure rather than low-level pixel statistics.

Architecture of Latent Diffusion Models

A latent diffusion system typically contains three major components.

ComponentPurpose
AutoencoderCompress and reconstruct images
Diffusion U-NetPerform denoising in latent space
Conditioning modelEncode prompts or other guidance

The workflow becomes:

x0z0ztz^0x^0. x_0 \rightarrow z_0 \rightarrow z_t \rightarrow \hat{z}_0 \rightarrow \hat{x}_0.

The diffusion model itself never directly processes full-resolution images.

Cross-Attention Conditioning

Modern latent diffusion systems condition generation on text.

Suppose a text encoder produces embeddings:

c=TextEncoder(y), c = \mathrm{TextEncoder}(y),

where yy is the prompt.

The denoising model predicts:

ϵθ(zt,t,c). \epsilon_\theta(z_t,t,c).

Cross-attention allows latent features to attend to text embeddings.

The attention mechanism computes:

Attention(Q,K,V)=softmax(QKd)V. \mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}

Here:

SymbolMeaning
QQQueries from latent features
KKKeys from text embeddings
VVValues from text embeddings

Cross-attention enables the latent image representation to incorporate prompt semantics during denoising.

Stable Diffusion Pipeline

A simplified latent diffusion pipeline looks like this:

  1. Encode text prompt into embeddings
  2. Sample latent Gaussian noise
  3. Iteratively denoise latent representation
  4. Decode latent tensor into image

Mathematically:

c=TextEncoder(y), c = \mathrm{TextEncoder}(y), zTN(0,I), z_T\sim\mathcal{N}(0,I), zt1pθ(zt1zt,c), z_{t-1}\sim p_\theta(z_{t-1}\mid z_t,c), x0=D(z0). x_0 = \mathcal{D}(z_0).

The latent tensor evolves gradually from noise into structured semantic content.

Latent Tensor Shapes

In many latent diffusion systems, tensor shapes follow conventions such as:

TensorShape
Image[B, 3, 512, 512]
Latent[B, 4, 64, 64]
Text embeddings[B, T, D]

Example:

images = torch.randn(8, 3, 512, 512)

latents = torch.randn(8, 4, 64, 64)

text_embeddings = torch.randn(8, 77, 768)

The diffusion U-Net processes latent tensors rather than pixel tensors.

Training Procedure

Training latent diffusion involves multiple stages.

Stage 1: Train the Autoencoder

The encoder-decoder system learns to reconstruct images.

Losses may include:

LossPurpose
Reconstruction lossPixel fidelity
Perceptual lossSemantic similarity
KL lossLatent regularization
Adversarial lossSharper outputs

Stage 2: Freeze the Autoencoder

After training, the encoder and decoder are fixed.

Stage 3: Train the Diffusion Model

Images are encoded into latents:

z0=E(x0). z_0=\mathcal{E}(x_0).

Noise is added:

zt=αˉtz0+1αˉtϵ. z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

The diffusion model predicts noise:

ϵθ(zt,t,c). \epsilon_\theta(z_t,t,c).

The training objective becomes:

L=E[ϵϵθ(zt,t,c)22]. \mathcal{L} = \mathbb{E} \left[ \| \epsilon - \epsilon_\theta(z_t,t,c) \|_2^2 \right].

Latent Scaling

Latent representations may have arbitrary variance depending on encoder training.

To stabilize diffusion training, latent vectors are often rescaled:

z0=sz0, z'_0 = s z_0,

where ss is a constant scaling factor.

For example, some systems normalize latent standard deviation so that diffusion noise schedules behave consistently.

Without proper scaling:

ProblemConsequence
Latents too largeNoise becomes too weak
Latents too smallSignal disappears too quickly
Inconsistent varianceTraining instability

Advantages of Latent Diffusion

Latent diffusion provides several major advantages.

AdvantageExplanation
Lower compute costSmaller tensors
Lower memory usageReduced spatial resolution
Faster trainingLess expensive denoising
Faster inferenceSmaller U-Net operations
Better scalabilityLarger images become feasible
Semantic modelingFocus on high-level structure

This efficiency enabled practical open-source large-scale text-to-image generation.

Limitations of Latent Diffusion

Latent compression also introduces limitations.

LimitationCause
Loss of fine detailCompression bottleneck
Reconstruction artifactsImperfect decoder
Semantic driftEncoder information loss
Decoder dependenceFinal quality limited by decoder
Compression biasLatent space may favor certain textures

The decoder becomes part of the generative pipeline. Even perfect latent denoising cannot exceed decoder reconstruction quality.

Pixel Diffusion Versus Latent Diffusion

PropertyPixel DiffusionLatent Diffusion
Operating spacePixelsCompressed latents
Tensor sizeLargeSmall
Compute costHighLower
Memory usageHighLower
Fine detail modelingStrongDecoder-limited
ScalabilityHarderEasier
Sampling speedSlowerFaster

Pixel diffusion may preserve fine textures more directly. Latent diffusion is usually more practical for large-scale systems.

PyTorch Example: Encoding and Diffusion

Suppose an autoencoder is defined as:

encoder = AutoencoderEncoder()
decoder = AutoencoderDecoder()

Encode images:

images = torch.randn(8, 3, 512, 512)

latents = encoder(images)

print(latents.shape)
# torch.Size([8, 4, 64, 64])

Add diffusion noise:

noise = torch.randn_like(latents)

t = torch.randint(0, T, (8,))

alpha_bar_t = extract(alpha_bars, t, latents.shape)

z_t = (
    torch.sqrt(alpha_bar_t) * latents
    +
    torch.sqrt(1 - alpha_bar_t) * noise
)

Predict noise:

pred_noise = unet(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

Decode generated latent:

generated_images = decoder(latents)

This structure is the core workflow of many modern latent diffusion systems.

Classifier-Free Guidance in Latent Space

Latent diffusion commonly uses classifier-free guidance.

The model predicts:

ϵθ(zt,t,c) \epsilon_\theta(z_t,t,c)

and

ϵθ(zt,t,). \epsilon_\theta(z_t,t,\varnothing).

The guided prediction becomes:

ϵ^=ϵuncond+s(ϵcondϵuncond). \hat{\epsilon} = \epsilon_\text{uncond} + s ( \epsilon_\text{cond} - \epsilon_\text{uncond} ).

The guidance scale ss controls prompt strength.

Guidance scaleEffect
SmallMore diversity
ModerateBetter prompt adherence
LargeSharper but less diverse outputs

Very large guidance scales may produce oversaturated or unstable images.

Latent Diffusion Beyond Images

The latent diffusion idea generalizes beyond image generation.

Applications include:

DomainLatent representation
Video generationSpatiotemporal latent tensors
Audio synthesisSpectrogram or audio latents
3D generationGeometry or radiance field latents
Motion generationPose or trajectory latents
Molecular generationGraph or embedding latents

The key principle remains unchanged:

  1. Learn a compressed representation
  2. Diffuse in latent space
  3. Decode into the original domain

Why Latent Diffusion Became Dominant

Latent diffusion balanced three competing requirements:

RequirementChallenge
High visual qualityRequires expressive models
Large image resolutionRequires large tensors
Practical compute costRequires efficiency

Pixel-space diffusion achieved quality but was expensive. GANs were fast but often unstable. Autoregressive image models scaled poorly with resolution.

Latent diffusion provided a practical compromise:

FeatureResult
CompressionLower compute
Diffusion trainingStable optimization
Attention conditioningStrong prompt control
U-Net denoisingHigh-quality structure generation

This combination made large-scale open text-to-image systems feasible.

Summary

Latent diffusion performs diffusion in a compressed latent representation rather than directly in pixel space.

An encoder maps images into latent tensors:

z0=E(x0). z_0=\mathcal{E}(x_0).

The diffusion process operates on these latent representations:

zt=αˉtz0+1αˉtϵ. z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

A denoising network learns the reverse process in latent space. After denoising, a decoder reconstructs the final image.

Latent diffusion greatly reduces computational cost while preserving semantic structure. This architecture became foundational in modern text-to-image generation systems because it combines efficient compression, stable diffusion training, and flexible conditioning mechanisms.