Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.
Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.
A video sample can be represented as a tensor:
where is batch size, is channels, is the number of frames, is height, and is width.
For example:
video = torch.randn(2, 3, 16, 256, 256)This represents a batch of 2 videos, each with 3 color channels, 16 frames, and spatial resolution .
From Image Generation to Video Generation
Image diffusion models learn a distribution over images:
Video diffusion models learn a distribution over frame sequences:
where denotes all frames in the video.
The added difficulty is temporal coherence. A good video model must satisfy both spatial and temporal constraints.
| Requirement | Meaning |
|---|---|
| Spatial quality | Each frame should look realistic |
| Temporal coherence | Objects should remain consistent across frames |
| Motion realism | Movement should follow plausible dynamics |
| Long-range consistency | Scene identity should persist over time |
| Prompt alignment | Video should match the text or conditioning input |
A model that generates good individual frames may still fail as a video model if objects flicker, identities change, or motion appears unstable.
Forward Diffusion for Video
The forward diffusion process is the same as image diffusion, but applied to video tensors.
where
has the same shape as the video.
In PyTorch:
def q_sample_video(x0, t, alpha_bars):
noise = torch.randn_like(x0)
alpha_bar_t = extract(alpha_bars, t, x0.shape)
xt = (
torch.sqrt(alpha_bar_t) * x0
+
torch.sqrt(1.0 - alpha_bar_t) * noise
)
return xt, noiseIf x0 has shape [B, C, F, H, W], then xt and noise have the same shape.
The mathematics is unchanged. The challenge lies in the denoising network, which must model correlations across both space and time.
Latent Video Diffusion
Pixel-space video diffusion is extremely expensive. A short video may contain dozens of high-resolution frames.
For example, a 16-frame RGB video at resolution contains:
values per sample.
Latent video diffusion reduces this cost by encoding frames into latent representations.
An image autoencoder maps each frame into a latent:
For video, the latent tensor may have shape:
[B, C_z, F, H_z, W_z]For example:
latents = torch.randn(2, 4, 16, 64, 64)Diffusion then operates on latent videos:
After denoising, a decoder converts latent frames back to pixel frames.
Temporal Modeling
A video diffusion model needs temporal structure. The model must understand how information changes across frames.
Common temporal modeling methods include:
| Method | Description |
|---|---|
| 3D convolutions | Apply convolution over time, height, and width |
| Temporal attention | Let frames attend to other frames |
| Factorized attention | Separate spatial attention and temporal attention |
| Recurrent state | Carry information between frames |
| Motion modules | Add temporal layers to an image diffusion backbone |
| Transformer blocks | Model spatiotemporal token sequences |
The simplest extension is a 3D U-Net. It replaces 2D convolution layers with 3D convolution layers:
However, full 3D computation is costly. Many modern systems instead start from a strong image diffusion model and add temporal modules.
Factorized Space-Time Attention
Full attention over video tokens is expensive.
Suppose a latent video has shape:
The number of tokens is:
Full self-attention has cost:
For video, this becomes expensive quickly.
Factorized attention reduces cost by separating spatial and temporal attention.
Spatial attention attends within each frame:
Temporal attention attends across frames at each spatial location:
This is usually cheaper than:
The model can first learn spatial relationships within each frame, then learn how features move over time.
Text-to-Video Conditioning
Text-to-video generation conditions the reverse process on a prompt:
where
The prompt may specify:
| Prompt element | Video effect |
|---|---|
| Subject | What appears |
| Action | What moves |
| Scene | Where it happens |
| Camera motion | How viewpoint changes |
| Style | Visual appearance |
| Duration hints | Event structure |
Examples:
"A panda surfing on a wave, cinematic lighting""A drone shot flying over a futuristic city at sunset"Text conditioning is usually injected using cross-attention, as in text-to-image systems.
Image-to-Video Generation
Image-to-video models start from a still image and generate motion.
Given an image , the model learns:
The first frame, appearance, or identity should remain consistent with the reference image.
Common conditioning methods include:
| Conditioning method | Purpose |
|---|---|
| Reference image embedding | Preserve identity and style |
| First-frame conditioning | Anchor the generated video |
| Depth or pose control | Guide motion geometry |
| Optical flow hints | Guide frame-to-frame movement |
| Camera trajectory | Control viewpoint changes |
Image-to-video is often easier than pure text-to-video because the model receives concrete visual structure at the start.
Motion Consistency
Motion consistency is the central problem in video generation.
A model must preserve:
| Consistency type | Example |
|---|---|
| Object identity | Same person or object across frames |
| Geometry | Stable shape and viewpoint |
| Texture | Clothing, fur, material consistency |
| Lighting | Stable illumination |
| Background | Scene remains coherent |
| Camera motion | Smooth viewpoint movement |
Without temporal modeling, an image diffusion model applied independently to each frame produces flicker. Each frame may look plausible, but the sequence fails as video.
Temporal layers reduce flicker by sharing information across frames.
Training Data for Video Diffusion
Video diffusion models require large datasets of video clips.
Training data usually includes:
| Data | Use |
|---|---|
| Video frames | Visual supervision |
| Captions | Text conditioning |
| Timestamps | Temporal order |
| Motion metadata | Optional control |
| Audio | Optional multimodal conditioning |
Video data is harder to curate than image data because it has more failure modes:
| Issue | Effect |
|---|---|
| Low resolution | Weak visual detail |
| Compression artifacts | Learned artifacts |
| Watermarks | Undesired generations |
| Poor captions | Weak prompt alignment |
| Scene cuts | Broken temporal continuity |
| Camera shake | Noisy motion patterns |
Good video training data should contain coherent clips, accurate captions, and diverse motion.
Frame Rate and Duration
Video models must choose a frame rate and duration.
A model may generate:
16 frames at 8 fpswhich gives 2 seconds of video.
Or:
64 frames at 24 fpswhich gives about 2.67 seconds of video.
Higher frame rate improves smoothness but increases compute. Longer duration improves usefulness but makes long-range consistency harder.
| Design choice | Tradeoff |
|---|---|
| More frames | Better duration, higher compute |
| Higher resolution | Better quality, higher memory |
| Higher fps | Smoother motion, harder modeling |
| Longer clips | Better storytelling, harder consistency |
Sliding Window Generation
Long video generation often uses sliding windows.
Instead of generating all frames at once, the model generates a short clip, then conditions the next clip on previous frames.
Example:
The overlapping frames help maintain continuity.
However, errors can accumulate. If the generated video drifts, later windows may become inconsistent with earlier ones.
Multi-Stage Video Generation
Many systems use multiple stages.
A typical pipeline:
- Generate low-resolution video latents
- Upscale spatial resolution
- Interpolate or refine frames
- Apply temporal super-resolution
- Decode to final video
This separates global motion from fine detail.
| Stage | Goal |
|---|---|
| Base generation | Scene and motion |
| Spatial upsampling | Higher resolution |
| Temporal upsampling | More frames |
| Refinement | Remove artifacts |
Multi-stage systems are easier to scale because each stage solves a narrower problem.
Video Diffusion Loss
The basic noise prediction loss remains:
For video tensors, this loss averages over channels, frames, height, and width.
In PyTorch:
pred_noise = model(z_t, t, text_embeddings)
loss = torch.nn.functional.mse_loss(
pred_noise,
noise
)Additional losses may encourage temporal smoothness:
| Loss | Purpose |
|---|---|
| Optical flow consistency | Preserve motion |
| Perceptual loss | Improve visual quality |
| Temporal adversarial loss | Reduce flicker |
| Frame interpolation loss | Improve smooth transitions |
Many modern systems rely primarily on diffusion loss plus strong temporal architecture rather than explicit handcrafted temporal losses.
Efficient Training Techniques
Video diffusion is memory-intensive. Common efficiency techniques include:
| Technique | Purpose |
|---|---|
| Latent diffusion | Reduce spatial size |
| Mixed precision | Lower memory use |
| Gradient checkpointing | Trade compute for memory |
| Factorized attention | Reduce attention cost |
| Frame subsampling | Reduce temporal length |
| Distributed training | Scale across GPUs |
| Low-rank adaptation | Fine-tune cheaply |
| Model distillation | Reduce inference steps |
Training often uses small clips first, then increases resolution or duration during later stages.
Common Failure Modes
Video diffusion has characteristic failures.
| Failure mode | Description |
|---|---|
| Flicker | Frame-to-frame appearance changes |
| Identity drift | Subject changes over time |
| Geometry collapse | Shapes deform implausibly |
| Motion blur | Weak temporal detail |
| Frozen motion | Image changes too little |
| Prompt drift | Video stops following prompt |
| Scene cuts | Abrupt unintended transitions |
| Texture swimming | Surface textures move incorrectly |
These failures reflect the difficulty of modeling coherent 4D structure: time plus 3D appearance.
PyTorch Shape Example
A minimal video diffusion training step has the same structure as image diffusion.
def video_diffusion_loss(model, video, text_embeddings, schedule):
"""
video: [B, C, F, H, W]
text_embeddings: [B, T_text, D]
"""
batch_size = video.shape[0]
device = video.device
t = torch.randint(
0,
schedule.num_steps,
(batch_size,),
device=device,
)
noise = torch.randn_like(video)
x_t = schedule.q_sample(video, t, noise)
pred_noise = model(
x_t,
t,
encoder_hidden_states=text_embeddings,
)
return torch.nn.functional.mse_loss(pred_noise, noise)For latent video diffusion, replace video with a latent tensor:
latents.shape
# torch.Size([B, 4, F, H_z, W_z])The training loop remains the same.
Relationship to World Models
Video generation is related to world modeling. A video model must learn how scenes evolve.
However, text-to-video diffusion models are usually generative simulators rather than explicit physical simulators. They can learn common motion patterns, but they may violate physical consistency.
For example, a model may generate plausible waves, walking, or camera pans, but still fail at:
| Physical property | Possible failure |
|---|---|
| Object permanence | Objects disappear |
| Conservation | Objects change mass or shape |
| Contact dynamics | Hands pass through objects |
| Causality | Effects precede causes |
| Long-horizon planning | Actions lose coherence |
This makes video diffusion useful for synthesis, editing, and design, but limited as a precise simulator.
Summary
Video diffusion extends diffusion models from images to frame sequences. The forward process adds Gaussian noise to video tensors or latent video tensors. The reverse model learns to denoise while preserving both spatial quality and temporal coherence.
The main architectural challenge is temporal modeling. Systems use 3D convolutions, temporal attention, factorized attention, motion modules, and multi-stage generation pipelines to produce coherent motion.
Video diffusion is computationally expensive because it models space and time together. Latent representations, efficient attention, short clips, and staged generation make the problem more tractable.