Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.
Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.
However, transformers became increasingly attractive because they scale effectively with model size, support flexible conditioning, and capture long-range dependencies more naturally than convolutional architectures.
Diffusion Transformers, often abbreviated DiTs, replace or augment convolutional U-Nets with transformer-based architectures. Instead of treating images as grids processed by convolutions, DiTs treat latent representations as token sequences processed by self-attention.
This transition mirrors the broader shift from convolutional models to transformers in computer vision and language modeling.
From U-Nets to Transformers
A standard diffusion U-Net processes tensors such as:
[B, C, H, W]using convolutional layers, residual blocks, and attention modules.
A diffusion transformer instead converts latent tensors into tokens:
where:
| Symbol | Meaning |
|---|---|
| Batch size | |
| Number of tokens | |
| Embedding dimension |
The transformer then processes these tokens using self-attention and feedforward layers.
This changes the denoising problem from spatial convolution to sequence modeling.
Patch Tokenization
Diffusion transformers usually operate on latent patches rather than individual pixels.
Suppose a latent tensor has shape:
[B, C, H, W]For example:
[B, 4, 64, 64]The tensor is divided into patches.
If the patch size is:
then the number of patches becomes:
Each patch is flattened and projected into an embedding vector.
For example:
| Latent size | Patch size | Number of tokens |
|---|---|---|
| 1024 | ||
| 256 | ||
| 256 |
Patch embeddings transform spatial tensors into transformer token sequences.
Transformer Denoising Objective
The diffusion objective remains unchanged.
Given noisy latent:
the transformer predicts:
The loss is still:
The main difference lies in the network architecture, not in the diffusion mathematics.
Self-Attention in Diffusion
Transformers process tokens using self-attention.
The attention mechanism computes:
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}
For diffusion transformers:
| Tensor | Meaning |
|---|---|
| Query embeddings | |
| Key embeddings | |
| Value embeddings |
Every latent token can attend to every other token.
This provides:
| Benefit | Explanation |
|---|---|
| Global receptive field | Long-range dependencies modeled directly |
| Flexible conditioning | Easy integration of text tokens |
| Better scaling | Transformer scaling laws often favorable |
| Unified architecture | Similarity to language models |
Unlike convolutions, attention does not rely on local neighborhoods.
Positional Embeddings
Transformers require positional information because self-attention alone is permutation invariant.
Diffusion transformers add positional embeddings:
The positional vector encodes spatial location.
Common methods include:
| Method | Description |
|---|---|
| Learned embeddings | Trainable position vectors |
| Sinusoidal embeddings | Fixed Fourier-like encoding |
| Rotary embeddings | Relative rotational encoding |
| 2D positional embeddings | Separate height and width structure |
Without positional embeddings, the model would not know where image patches belong spatially.
Timestep Conditioning
The denoising network must know the diffusion timestep.
A timestep embedding:
is injected into transformer blocks.
Sinusoidal timestep embeddings are common:
These embeddings are passed through learned projection layers before conditioning the transformer.
Adaptive Layer Normalization
Many diffusion transformers use adaptive layer normalization, often called AdaLN.
Instead of fixed normalization parameters, timestep and conditioning embeddings modulate activations.
Standard layer normalization is:
Adaptive layer normalization replaces and with conditioning-dependent parameters:
This allows prompt and timestep information to influence the transformer at every layer.
AdaLN became important because it provides stable conditioning without requiring heavy cross-attention everywhere.
Diffusion Transformer Block
A diffusion transformer block usually contains:
- Layer normalization
- Self-attention
- Residual connection
- Feedforward network
- Conditioning modulation
A simplified structure:
Conditioning enters through:
| Conditioning source | Mechanism |
|---|---|
| Timestep | AdaLN or embedding injection |
| Text prompt | Cross-attention |
| Class labels | Embedding modulation |
| Image conditioning | Additional tokens |
Cross-Attention Conditioning
Text-to-image diffusion transformers usually use cross-attention.
Text embeddings:
interact with image latent tokens.
The transformer computes:
| Tensor | Source |
|---|---|
| Queries | Image tokens |
| Keys | Text tokens |
| Values | Text tokens |
This lets image generation depend on language semantics.
Compared with convolutional U-Nets, transformers integrate multimodal conditioning naturally because all modalities become token sequences.
DiT Architecture
A canonical diffusion transformer architecture typically includes:
| Stage | Purpose |
|---|---|
| Patch embedding | Convert latent patches into tokens |
| Positional embedding | Encode spatial structure |
| Transformer blocks | Perform denoising computation |
| Conditioning layers | Inject timestep and text information |
| Output projection | Convert tokens back to latent patches |
The overall process:
The transformer predicts latent noise, which is reshaped back into latent spatial form.
Scaling Properties
Transformers often scale better than convolutional U-Nets as model size increases.
Empirically:
| Increase | Effect |
|---|---|
| More parameters | Better sample quality |
| More data | Stronger generalization |
| More compute | Improved prompt adherence |
| Larger context | Better compositionality |
This resembles scaling behavior in language models.
Large diffusion transformers may learn richer semantic structure and stronger multimodal alignment than smaller convolutional models.
Computational Complexity
Transformers also introduce challenges.
Self-attention has quadratic complexity:
with respect to the number of tokens.
For high-resolution images, token count becomes large.
Example:
| Resolution | Patch size | Tokens |
|---|---|---|
| 1024 | ||
| 4096 | ||
| 16384 |
Attention cost grows rapidly.
Efficient attention methods therefore become important.
Efficient Attention Methods
Modern diffusion transformers use attention optimizations such as:
| Method | Purpose |
|---|---|
| FlashAttention | Faster memory-efficient attention |
| Windowed attention | Restrict local attention |
| Sparse attention | Reduce pairwise interactions |
| Linear attention | Approximate quadratic attention |
| Multi-query attention | Reduce memory usage |
| Token merging | Reduce sequence length |
These techniques make large diffusion transformers practical at high resolution.
Latent Space and Transformers
Most diffusion transformers operate in latent space rather than pixel space.
This reduces token count dramatically.
Suppose a latent tensor has shape:
[B, 4, 64, 64]Using patches:
Without latent compression, operating directly on images would require vastly more tokens.
Latent diffusion and transformers therefore complement each other:
| Technique | Benefit |
|---|---|
| Latent diffusion | Smaller spatial representation |
| Transformers | Flexible global modeling |
Together they enable scalable generative systems.
Diffusion Transformers for Video
Transformers naturally extend to video because videos can also be represented as token sequences.
A latent video tensor:
[B, C, F, H, W]can be patchified into spatiotemporal tokens.
The transformer then models:
| Dependency type | Example |
|---|---|
| Spatial | Relationships within a frame |
| Temporal | Motion across frames |
| Cross-modal | Text-video conditioning |
Video diffusion transformers often use factorized attention:
| Attention type | Scope |
|---|---|
| Spatial attention | Within frame |
| Temporal attention | Across frames |
| Cross-attention | Prompt conditioning |
This improves efficiency relative to full attention across all video tokens.
Training Diffusion Transformers
Training resembles standard diffusion training.
Given latent:
the transformer predicts noise:
PyTorch example:
pred_noise = dit(
latents,
timesteps,
text_embeddings
)
loss = torch.nn.functional.mse_loss(
pred_noise,
noise
)The objective is unchanged from U-Net diffusion systems.
Advantages of Diffusion Transformers
Diffusion transformers provide several benefits.
| Advantage | Explanation |
|---|---|
| Global context modeling | Attention connects distant regions |
| Strong scaling behavior | Similar to large language models |
| Unified multimodal architecture | Text and image tokens integrate naturally |
| Flexible conditioning | Multiple modalities become token streams |
| Better compositionality | Improved concept interaction |
Transformers also simplify architectural unification across text, image, audio, and video generation.
Limitations of Diffusion Transformers
Transformers also have weaknesses.
| Limitation | Cause |
|---|---|
| High memory usage | Quadratic attention |
| Large compute cost | Long token sequences |
| Training instability | Very deep transformer optimization |
| Slow inference | Many denoising steps plus attention cost |
| Data hunger | Large transformers require large datasets |
Efficient training and inference remain active research areas.
Relationship to Foundation Models
Diffusion transformers move generative modeling closer to foundation-model architectures.
A single transformer architecture can potentially process:
| Modality | Token type |
|---|---|
| Text | Wordpiece tokens |
| Images | Patch tokens |
| Video | Spatiotemporal tokens |
| Audio | Spectrogram tokens |
| 3D scenes | Spatial tokens |
This motivates unified multimodal generative systems.
Modern research increasingly explores:
| Direction | Goal |
|---|---|
| Unified token spaces | Shared multimodal representations |
| Joint training | Multiple modalities together |
| World models | Predictive generative simulation |
| Large multimodal transformers | General-purpose generative systems |
Diffusion transformers fit naturally into this trend.
PyTorch Patch Embedding Example
A simple patch embedding layer:
import torch
import torch.nn as nn
class PatchEmbed(nn.Module):
def __init__(
self,
in_channels=4,
patch_size=2,
embed_dim=768,
):
super().__init__()
self.proj = nn.Conv2d(
in_channels,
embed_dim,
kernel_size=patch_size,
stride=patch_size,
)
def forward(self, x):
"""
x: [B, C, H, W]
"""
x = self.proj(x)
B, D, H, W = x.shape
x = x.flatten(2)
x = x.transpose(1, 2)
return xUsage:
latents = torch.randn(8, 4, 64, 64)
patch_embed = PatchEmbed()
tokens = patch_embed(latents)
print(tokens.shape)
# torch.Size([8, 1024, 768])The latent image becomes a sequence of transformer tokens.
Summary
Diffusion transformers replace convolutional denoising networks with transformer architectures operating on latent token sequences.
Latent tensors are patchified into tokens, processed using self-attention, conditioned on timestep and prompt embeddings, and projected back into latent space for denoising.
The diffusion objective remains unchanged:
Transformers provide strong global modeling, flexible conditioning, and favorable scaling behavior. Combined with latent diffusion, they form a scalable architecture for modern multimodal generative systems.