Skip to content

Diffusion Transformers

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.

However, transformers became increasingly attractive because they scale effectively with model size, support flexible conditioning, and capture long-range dependencies more naturally than convolutional architectures.

Diffusion Transformers, often abbreviated DiTs, replace or augment convolutional U-Nets with transformer-based architectures. Instead of treating images as grids processed by convolutions, DiTs treat latent representations as token sequences processed by self-attention.

This transition mirrors the broader shift from convolutional models to transformers in computer vision and language modeling.

From U-Nets to Transformers

A standard diffusion U-Net processes tensors such as:

[B, C, H, W]

using convolutional layers, residual blocks, and attention modules.

A diffusion transformer instead converts latent tensors into tokens:

zRB×N×D z \in \mathbb{R}^{B \times N \times D}

where:

SymbolMeaning
BBBatch size
NNNumber of tokens
DDEmbedding dimension

The transformer then processes these tokens using self-attention and feedforward layers.

This changes the denoising problem from spatial convolution to sequence modeling.

Patch Tokenization

Diffusion transformers usually operate on latent patches rather than individual pixels.

Suppose a latent tensor has shape:

[B, C, H, W]

For example:

[B, 4, 64, 64]

The tensor is divided into patches.

If the patch size is:

P×P, P \times P,

then the number of patches becomes:

N=HPWP. N = \frac{H}{P} \cdot \frac{W}{P}.

Each patch is flattened and projected into an embedding vector.

For example:

Latent sizePatch sizeNumber of tokens
64×6464\times642×22\times21024
64×6464\times644×44\times4256
32×3232\times322×22\times2256

Patch embeddings transform spatial tensors into transformer token sequences.

Transformer Denoising Objective

The diffusion objective remains unchanged.

Given noisy latent:

zt=αˉtz0+1αˉtϵ, z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon,

the transformer predicts:

ϵθ(zt,t,c). \epsilon_\theta(z_t,t,c).

The loss is still:

L=E[ϵϵθ(zt,t,c)22]. \mathcal{L} = \mathbb{E} \left[ \| \epsilon - \epsilon_\theta(z_t,t,c) \|_2^2 \right].

The main difference lies in the network architecture, not in the diffusion mathematics.

Self-Attention in Diffusion

Transformers process tokens using self-attention.

The attention mechanism computes:

Attention(Q,K,V)=softmax(QKd)V. \mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V"}}

For diffusion transformers:

TensorMeaning
QQQuery embeddings
KKKey embeddings
VVValue embeddings

Every latent token can attend to every other token.

This provides:

BenefitExplanation
Global receptive fieldLong-range dependencies modeled directly
Flexible conditioningEasy integration of text tokens
Better scalingTransformer scaling laws often favorable
Unified architectureSimilarity to language models

Unlike convolutions, attention does not rely on local neighborhoods.

Positional Embeddings

Transformers require positional information because self-attention alone is permutation invariant.

Diffusion transformers add positional embeddings:

zi=zi+pi. z'_i = z_i + p_i.

The positional vector pip_i encodes spatial location.

Common methods include:

MethodDescription
Learned embeddingsTrainable position vectors
Sinusoidal embeddingsFixed Fourier-like encoding
Rotary embeddingsRelative rotational encoding
2D positional embeddingsSeparate height and width structure

Without positional embeddings, the model would not know where image patches belong spatially.

Timestep Conditioning

The denoising network must know the diffusion timestep.

A timestep embedding:

et=Embed(t) e_t = \mathrm{Embed}(t)

is injected into transformer blocks.

Sinusoidal timestep embeddings are common:

PE(t,2i)=sin(t100002i/d), \mathrm{PE}(t,2i) = \sin \left( \frac{t}{10000^{2i/d}} \right), PE(t,2i+1)=cos(t100002i/d). \mathrm{PE}(t,2i+1) = \cos \left( \frac{t}{10000^{2i/d}} \right).

These embeddings are passed through learned projection layers before conditioning the transformer.

Adaptive Layer Normalization

Many diffusion transformers use adaptive layer normalization, often called AdaLN.

Instead of fixed normalization parameters, timestep and conditioning embeddings modulate activations.

Standard layer normalization is:

LN(x)=γxμσ+β. \mathrm{LN}(x) = \gamma \frac{x-\mu}{\sigma} + \beta.

Adaptive layer normalization replaces γ\gamma and β\beta with conditioning-dependent parameters:

γ=γ(c,t),β=β(c,t). \gamma = \gamma(c,t), \qquad \beta = \beta(c,t).

This allows prompt and timestep information to influence the transformer at every layer.

AdaLN became important because it provides stable conditioning without requiring heavy cross-attention everywhere.

Diffusion Transformer Block

A diffusion transformer block usually contains:

  1. Layer normalization
  2. Self-attention
  3. Residual connection
  4. Feedforward network
  5. Conditioning modulation

A simplified structure:

xAttentionMLPx. x \rightarrow \mathrm{Attention} \rightarrow \mathrm{MLP} \rightarrow x'.

Conditioning enters through:

Conditioning sourceMechanism
TimestepAdaLN or embedding injection
Text promptCross-attention
Class labelsEmbedding modulation
Image conditioningAdditional tokens

Cross-Attention Conditioning

Text-to-image diffusion transformers usually use cross-attention.

Text embeddings:

cRB×T×D c \in \mathbb{R}^{B\times T\times D}

interact with image latent tokens.

The transformer computes:

TensorSource
QueriesImage tokens
KeysText tokens
ValuesText tokens

This lets image generation depend on language semantics.

Compared with convolutional U-Nets, transformers integrate multimodal conditioning naturally because all modalities become token sequences.

DiT Architecture

A canonical diffusion transformer architecture typically includes:

StagePurpose
Patch embeddingConvert latent patches into tokens
Positional embeddingEncode spatial structure
Transformer blocksPerform denoising computation
Conditioning layersInject timestep and text information
Output projectionConvert tokens back to latent patches

The overall process:

zttokenstransformerϵ^ z_t \rightarrow \text{tokens} \rightarrow \text{transformer} \rightarrow \hat{\epsilon}

The transformer predicts latent noise, which is reshaped back into latent spatial form.

Scaling Properties

Transformers often scale better than convolutional U-Nets as model size increases.

Empirically:

IncreaseEffect
More parametersBetter sample quality
More dataStronger generalization
More computeImproved prompt adherence
Larger contextBetter compositionality

This resembles scaling behavior in language models.

Large diffusion transformers may learn richer semantic structure and stronger multimodal alignment than smaller convolutional models.

Computational Complexity

Transformers also introduce challenges.

Self-attention has quadratic complexity:

O(N2) O(N^2)

with respect to the number of tokens.

For high-resolution images, token count becomes large.

Example:

ResolutionPatch sizeTokens
64×6464\times642×22\times21024
128×128128\times1282×22\times24096
256×256256\times2562×22\times216384

Attention cost grows rapidly.

Efficient attention methods therefore become important.

Efficient Attention Methods

Modern diffusion transformers use attention optimizations such as:

MethodPurpose
FlashAttentionFaster memory-efficient attention
Windowed attentionRestrict local attention
Sparse attentionReduce pairwise interactions
Linear attentionApproximate quadratic attention
Multi-query attentionReduce memory usage
Token mergingReduce sequence length

These techniques make large diffusion transformers practical at high resolution.

Latent Space and Transformers

Most diffusion transformers operate in latent space rather than pixel space.

This reduces token count dramatically.

Suppose a latent tensor has shape:

[B, 4, 64, 64]

Using 2×22\times2 patches:

N=(64/2)2=1024. N = (64/2)^2 = 1024.

Without latent compression, operating directly on 512×512512\times512 images would require vastly more tokens.

Latent diffusion and transformers therefore complement each other:

TechniqueBenefit
Latent diffusionSmaller spatial representation
TransformersFlexible global modeling

Together they enable scalable generative systems.

Diffusion Transformers for Video

Transformers naturally extend to video because videos can also be represented as token sequences.

A latent video tensor:

[B, C, F, H, W]

can be patchified into spatiotemporal tokens.

The transformer then models:

Dependency typeExample
SpatialRelationships within a frame
TemporalMotion across frames
Cross-modalText-video conditioning

Video diffusion transformers often use factorized attention:

Attention typeScope
Spatial attentionWithin frame
Temporal attentionAcross frames
Cross-attentionPrompt conditioning

This improves efficiency relative to full attention across all video tokens.

Training Diffusion Transformers

Training resembles standard diffusion training.

Given latent:

zt=αˉtz0+1αˉtϵ, z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon,

the transformer predicts noise:

ϵθ(zt,t,c). \epsilon_\theta(z_t,t,c).

PyTorch example:

pred_noise = dit(
    latents,
    timesteps,
    text_embeddings
)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

The objective is unchanged from U-Net diffusion systems.

Advantages of Diffusion Transformers

Diffusion transformers provide several benefits.

AdvantageExplanation
Global context modelingAttention connects distant regions
Strong scaling behaviorSimilar to large language models
Unified multimodal architectureText and image tokens integrate naturally
Flexible conditioningMultiple modalities become token streams
Better compositionalityImproved concept interaction

Transformers also simplify architectural unification across text, image, audio, and video generation.

Limitations of Diffusion Transformers

Transformers also have weaknesses.

LimitationCause
High memory usageQuadratic attention
Large compute costLong token sequences
Training instabilityVery deep transformer optimization
Slow inferenceMany denoising steps plus attention cost
Data hungerLarge transformers require large datasets

Efficient training and inference remain active research areas.

Relationship to Foundation Models

Diffusion transformers move generative modeling closer to foundation-model architectures.

A single transformer architecture can potentially process:

ModalityToken type
TextWordpiece tokens
ImagesPatch tokens
VideoSpatiotemporal tokens
AudioSpectrogram tokens
3D scenesSpatial tokens

This motivates unified multimodal generative systems.

Modern research increasingly explores:

DirectionGoal
Unified token spacesShared multimodal representations
Joint trainingMultiple modalities together
World modelsPredictive generative simulation
Large multimodal transformersGeneral-purpose generative systems

Diffusion transformers fit naturally into this trend.

PyTorch Patch Embedding Example

A simple patch embedding layer:

import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(
        self,
        in_channels=4,
        patch_size=2,
        embed_dim=768,
    ):
        super().__init__()

        self.proj = nn.Conv2d(
            in_channels,
            embed_dim,
            kernel_size=patch_size,
            stride=patch_size,
        )

    def forward(self, x):
        """
        x: [B, C, H, W]
        """

        x = self.proj(x)

        B, D, H, W = x.shape

        x = x.flatten(2)
        x = x.transpose(1, 2)

        return x

Usage:

latents = torch.randn(8, 4, 64, 64)

patch_embed = PatchEmbed()

tokens = patch_embed(latents)

print(tokens.shape)
# torch.Size([8, 1024, 768])

The latent image becomes a sequence of transformer tokens.

Summary

Diffusion transformers replace convolutional denoising networks with transformer architectures operating on latent token sequences.

Latent tensors are patchified into tokens, processed using self-attention, conditioned on timestep and prompt embeddings, and projected back into latent space for denoising.

The diffusion objective remains unchanged:

L=E[ϵϵθ(zt,t,c)22]. \mathcal{L} = \mathbb{E} \left[ \| \epsilon - \epsilon_\theta(z_t,t,c) \|_2^2 \right].

Transformers provide strong global modeling, flexible conditioning, and favorable scaling behavior. Combined with latent diffusion, they form a scalable architecture for modern multimodal generative systems.