Skip to content

Video Diffusion Systems

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.

A video sample can be represented as a tensor:

x0RB×C×F×H×W x_0 \in \mathbb{R}^{B \times C \times F \times H \times W}

where BB is batch size, CC is channels, FF is the number of frames, HH is height, and WW is width.

For example:

video = torch.randn(2, 3, 16, 256, 256)

This represents a batch of 2 videos, each with 3 color channels, 16 frames, and spatial resolution 256×256256 \times 256.

From Image Generation to Video Generation

Image diffusion models learn a distribution over images:

pθ(x) p_\theta(x)

Video diffusion models learn a distribution over frame sequences:

pθ(x1:F) p_\theta(x_{1:F})

where x1:Fx_{1:F} denotes all frames in the video.

The added difficulty is temporal coherence. A good video model must satisfy both spatial and temporal constraints.

RequirementMeaning
Spatial qualityEach frame should look realistic
Temporal coherenceObjects should remain consistent across frames
Motion realismMovement should follow plausible dynamics
Long-range consistencyScene identity should persist over time
Prompt alignmentVideo should match the text or conditioning input

A model that generates good individual frames may still fail as a video model if objects flicker, identities change, or motion appears unstable.

Forward Diffusion for Video

The forward diffusion process is the same as image diffusion, but applied to video tensors.

xt=αˉtx0+1αˉtϵ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon

where

ϵN(0,I) \epsilon \sim \mathcal{N}(0,I)

has the same shape as the video.

In PyTorch:

def q_sample_video(x0, t, alpha_bars):
    noise = torch.randn_like(x0)

    alpha_bar_t = extract(alpha_bars, t, x0.shape)

    xt = (
        torch.sqrt(alpha_bar_t) * x0
        +
        torch.sqrt(1.0 - alpha_bar_t) * noise
    )

    return xt, noise

If x0 has shape [B, C, F, H, W], then xt and noise have the same shape.

The mathematics is unchanged. The challenge lies in the denoising network, which must model correlations across both space and time.

Latent Video Diffusion

Pixel-space video diffusion is extremely expensive. A short video may contain dozens of high-resolution frames.

For example, a 16-frame RGB video at 512×512512 \times 512 resolution contains:

16×3×512×512=12,582,912 16 \times 3 \times 512 \times 512 = 12{,}582{,}912

values per sample.

Latent video diffusion reduces this cost by encoding frames into latent representations.

An image autoencoder maps each frame into a latent:

z0=E(x0) z_0 = \mathcal{E}(x_0)

For video, the latent tensor may have shape:

[B, C_z, F, H_z, W_z]

For example:

latents = torch.randn(2, 4, 16, 64, 64)

Diffusion then operates on latent videos:

zt=αˉtz0+1αˉtϵ. z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

After denoising, a decoder converts latent frames back to pixel frames.

Temporal Modeling

A video diffusion model needs temporal structure. The model must understand how information changes across frames.

Common temporal modeling methods include:

MethodDescription
3D convolutionsApply convolution over time, height, and width
Temporal attentionLet frames attend to other frames
Factorized attentionSeparate spatial attention and temporal attention
Recurrent stateCarry information between frames
Motion modulesAdd temporal layers to an image diffusion backbone
Transformer blocksModel spatiotemporal token sequences

The simplest extension is a 3D U-Net. It replaces 2D convolution layers with 3D convolution layers:

[B,C,H,W][B,C,F,H,W]. [B, C, H, W] \rightarrow [B, C, F, H, W].

However, full 3D computation is costly. Many modern systems instead start from a strong image diffusion model and add temporal modules.

Factorized Space-Time Attention

Full attention over video tokens is expensive.

Suppose a latent video has shape:

F×H×W F \times H \times W

The number of tokens is:

N=FHW. N = FHW.

Full self-attention has cost:

O(N2). O(N^2).

For video, this becomes expensive quickly.

Factorized attention reduces cost by separating spatial and temporal attention.

Spatial attention attends within each frame:

O(F(HW)2) O(F(HW)^2)

Temporal attention attends across frames at each spatial location:

O(HWF2) O(HW F^2)

This is usually cheaper than:

O((FHW)2). O((FHW)^2).

The model can first learn spatial relationships within each frame, then learn how features move over time.

Text-to-Video Conditioning

Text-to-video generation conditions the reverse process on a prompt:

pθ(zt1zt,c) p_\theta(z_{t-1}\mid z_t, c)

where

c=TextEncoder(y). c = \mathrm{TextEncoder}(y).

The prompt may specify:

Prompt elementVideo effect
SubjectWhat appears
ActionWhat moves
SceneWhere it happens
Camera motionHow viewpoint changes
StyleVisual appearance
Duration hintsEvent structure

Examples:

"A panda surfing on a wave, cinematic lighting"
"A drone shot flying over a futuristic city at sunset"

Text conditioning is usually injected using cross-attention, as in text-to-image systems.

Image-to-Video Generation

Image-to-video models start from a still image and generate motion.

Given an image xrefx_\text{ref}, the model learns:

pθ(x1:Fxref) p_\theta(x_{1:F}\mid x_\text{ref})

The first frame, appearance, or identity should remain consistent with the reference image.

Common conditioning methods include:

Conditioning methodPurpose
Reference image embeddingPreserve identity and style
First-frame conditioningAnchor the generated video
Depth or pose controlGuide motion geometry
Optical flow hintsGuide frame-to-frame movement
Camera trajectoryControl viewpoint changes

Image-to-video is often easier than pure text-to-video because the model receives concrete visual structure at the start.

Motion Consistency

Motion consistency is the central problem in video generation.

A model must preserve:

Consistency typeExample
Object identitySame person or object across frames
GeometryStable shape and viewpoint
TextureClothing, fur, material consistency
LightingStable illumination
BackgroundScene remains coherent
Camera motionSmooth viewpoint movement

Without temporal modeling, an image diffusion model applied independently to each frame produces flicker. Each frame may look plausible, but the sequence fails as video.

Temporal layers reduce flicker by sharing information across frames.

Training Data for Video Diffusion

Video diffusion models require large datasets of video clips.

Training data usually includes:

DataUse
Video framesVisual supervision
CaptionsText conditioning
TimestampsTemporal order
Motion metadataOptional control
AudioOptional multimodal conditioning

Video data is harder to curate than image data because it has more failure modes:

IssueEffect
Low resolutionWeak visual detail
Compression artifactsLearned artifacts
WatermarksUndesired generations
Poor captionsWeak prompt alignment
Scene cutsBroken temporal continuity
Camera shakeNoisy motion patterns

Good video training data should contain coherent clips, accurate captions, and diverse motion.

Frame Rate and Duration

Video models must choose a frame rate and duration.

A model may generate:

16 frames at 8 fps

which gives 2 seconds of video.

Or:

64 frames at 24 fps

which gives about 2.67 seconds of video.

Higher frame rate improves smoothness but increases compute. Longer duration improves usefulness but makes long-range consistency harder.

Design choiceTradeoff
More framesBetter duration, higher compute
Higher resolutionBetter quality, higher memory
Higher fpsSmoother motion, harder modeling
Longer clipsBetter storytelling, harder consistency

Sliding Window Generation

Long video generation often uses sliding windows.

Instead of generating all frames at once, the model generates a short clip, then conditions the next clip on previous frames.

Example:

x1:16x9:24x17:32 x_{1:16} \rightarrow x_{9:24} \rightarrow x_{17:32}

The overlapping frames help maintain continuity.

However, errors can accumulate. If the generated video drifts, later windows may become inconsistent with earlier ones.

Multi-Stage Video Generation

Many systems use multiple stages.

A typical pipeline:

  1. Generate low-resolution video latents
  2. Upscale spatial resolution
  3. Interpolate or refine frames
  4. Apply temporal super-resolution
  5. Decode to final video

This separates global motion from fine detail.

StageGoal
Base generationScene and motion
Spatial upsamplingHigher resolution
Temporal upsamplingMore frames
RefinementRemove artifacts

Multi-stage systems are easier to scale because each stage solves a narrower problem.

Video Diffusion Loss

The basic noise prediction loss remains:

L=E[ϵϵθ(zt,t,c)22]. \mathcal{L} = \mathbb{E} \left[ \|\epsilon-\epsilon_\theta(z_t,t,c)\|_2^2 \right].

For video tensors, this loss averages over channels, frames, height, and width.

In PyTorch:

pred_noise = model(z_t, t, text_embeddings)

loss = torch.nn.functional.mse_loss(
    pred_noise,
    noise
)

Additional losses may encourage temporal smoothness:

LossPurpose
Optical flow consistencyPreserve motion
Perceptual lossImprove visual quality
Temporal adversarial lossReduce flicker
Frame interpolation lossImprove smooth transitions

Many modern systems rely primarily on diffusion loss plus strong temporal architecture rather than explicit handcrafted temporal losses.

Efficient Training Techniques

Video diffusion is memory-intensive. Common efficiency techniques include:

TechniquePurpose
Latent diffusionReduce spatial size
Mixed precisionLower memory use
Gradient checkpointingTrade compute for memory
Factorized attentionReduce attention cost
Frame subsamplingReduce temporal length
Distributed trainingScale across GPUs
Low-rank adaptationFine-tune cheaply
Model distillationReduce inference steps

Training often uses small clips first, then increases resolution or duration during later stages.

Common Failure Modes

Video diffusion has characteristic failures.

Failure modeDescription
FlickerFrame-to-frame appearance changes
Identity driftSubject changes over time
Geometry collapseShapes deform implausibly
Motion blurWeak temporal detail
Frozen motionImage changes too little
Prompt driftVideo stops following prompt
Scene cutsAbrupt unintended transitions
Texture swimmingSurface textures move incorrectly

These failures reflect the difficulty of modeling coherent 4D structure: time plus 3D appearance.

PyTorch Shape Example

A minimal video diffusion training step has the same structure as image diffusion.

def video_diffusion_loss(model, video, text_embeddings, schedule):
    """
    video: [B, C, F, H, W]
    text_embeddings: [B, T_text, D]
    """
    batch_size = video.shape[0]
    device = video.device

    t = torch.randint(
        0,
        schedule.num_steps,
        (batch_size,),
        device=device,
    )

    noise = torch.randn_like(video)
    x_t = schedule.q_sample(video, t, noise)

    pred_noise = model(
        x_t,
        t,
        encoder_hidden_states=text_embeddings,
    )

    return torch.nn.functional.mse_loss(pred_noise, noise)

For latent video diffusion, replace video with a latent tensor:

latents.shape
# torch.Size([B, 4, F, H_z, W_z])

The training loop remains the same.

Relationship to World Models

Video generation is related to world modeling. A video model must learn how scenes evolve.

However, text-to-video diffusion models are usually generative simulators rather than explicit physical simulators. They can learn common motion patterns, but they may violate physical consistency.

For example, a model may generate plausible waves, walking, or camera pans, but still fail at:

Physical propertyPossible failure
Object permanenceObjects disappear
ConservationObjects change mass or shape
Contact dynamicsHands pass through objects
CausalityEffects precede causes
Long-horizon planningActions lose coherence

This makes video diffusion useful for synthesis, editing, and design, but limited as a precise simulator.

Summary

Video diffusion extends diffusion models from images to frame sequences. The forward process adds Gaussian noise to video tensors or latent video tensors. The reverse model learns to denoise while preserving both spatial quality and temporal coherence.

The main architectural challenge is temporal modeling. Systems use 3D convolutions, temporal attention, factorized attention, motion modules, and multi-stage generation pipelines to produce coherent motion.

Video diffusion is computationally expensive because it models space and time together. Latent representations, efficient attention, short clips, and staged generation make the problem more tractable.