A diffusion model needs a rule for how noise increases during the forward process. This rule is called the noise schedule. It determines how quickly clean data is corrupted, how much signal remains at each timestep, and how difficult each denoising task becomes.
The forward process is
The sequence
is the noise schedule. Each controls the variance of Gaussian noise added at step .
A good schedule should corrupt data gradually. Early timesteps should preserve most of the signal. Later timesteps should remove almost all information so that becomes close to standard Gaussian noise.
From Step Noise to Cumulative Noise
The one-step noise coefficient is . It is often more useful to track the cumulative signal that remains after many steps.
Define
Then define the cumulative product
The direct sampling formula is
The term controls the remaining signal power. The term controls the noise power.
Thus the schedule can be described in two equivalent ways:
| Quantity | Meaning |
|---|---|
| Noise added at one step | |
| Signal retained at one step | |
| Signal retained after steps | |
| Noise accumulated after steps |
Most schedule design is easier to understand through or signal-to-noise ratio.
Signal-to-Noise Ratio
The signal-to-noise ratio at timestep is
When is close to 1, the SNR is high. The noisy sample still resembles the original data. When is close to 0, the SNR is low. The sample is mostly noise.
SNR gives a clearer view of task difficulty.
| Region | SNR | Denoising task |
|---|---|---|
| Early timesteps | High | Remove small perturbations |
| Middle timesteps | Moderate | Recover structure and texture |
| Late timesteps | Low | Infer global semantics from weak signal |
The reverse model must learn all three regimes. If the schedule spends too few steps in one region, the model may become weak there.
Linear Beta Schedule
The simplest schedule increases linearly:
A common choice is
In PyTorch:
import torch
T = 1000
beta_min = 1e-4
beta_max = 2e-2
betas = torch.linspace(beta_min, beta_max, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)The linear beta schedule is easy to implement and works reasonably well. It was used in early denoising diffusion probabilistic models.
However, linear growth in does not imply linear change in perceptual noise or SNR. Because is a product over many values, the cumulative signal may decay unevenly.
Cosine Schedule
The cosine schedule defines the cumulative signal directly:
where
The small constant prevents the schedule from changing too abruptly near .
This schedule tends to preserve signal more gently at early timesteps and produce better sample quality in many image models.
In PyTorch:
import math
import torch
def cosine_alpha_bars(T, s=0.008):
steps = torch.arange(T + 1, dtype=torch.float32)
x = steps / T
alpha_bars = torch.cos(
((x + s) / (1 + s)) * math.pi / 2
) ** 2
alpha_bars = alpha_bars / alpha_bars[0]
return alpha_barsTo recover , use the relation
In code:
def betas_from_alpha_bars(alpha_bars, max_beta=0.999):
betas = []
for t in range(1, len(alpha_bars)):
beta = 1.0 - alpha_bars[t] / alpha_bars[t - 1]
beta = min(beta.item(), max_beta)
betas.append(beta)
return torch.tensor(betas, dtype=torch.float32)
alpha_bars = cosine_alpha_bars(T)
betas = betas_from_alpha_bars(alpha_bars)The cosine schedule is commonly used because it allocates noise levels more evenly in terms of useful denoising difficulty.
Quadratic and Sigmoid Schedules
A quadratic schedule makes grow slowly at first and faster later.
One simple construction is
In PyTorch:
def quadratic_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
return torch.linspace(
beta_min ** 0.5,
beta_max ** 0.5,
T
) ** 2A sigmoid schedule changes slowly near the beginning and end, and faster in the middle:
def sigmoid_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
x = torch.linspace(-6, 6, T)
betas = torch.sigmoid(x)
betas = betas * (beta_max - beta_min) + beta_min
return betasThese schedules are less canonical than linear and cosine schedules, but they illustrate the design freedom. What matters most is the induced path of and SNR.
Schedules in Continuous Time
Discrete diffusion uses timesteps
Continuous-time diffusion replaces the discrete schedule with continuous functions. Instead of , we define a noise rate , where
A common continuous forward SDE is
The function controls the rate at which noise is added.
Continuous schedules are useful because they allow the reverse process to be solved with numerical ODE or SDE solvers. They also separate the training noise distribution from the number of sampling steps.
Variance-Preserving and Variance-Exploding Schedules
Two major families of score-based diffusion processes are variance-preserving and variance-exploding schedules.
In a variance-preserving process, the total variance of remains approximately constant. The DDPM forward process is variance-preserving because it scales the signal while adding noise:
If and have unit variance, then also has unit variance.
In a variance-exploding process, noise variance increases over time without shrinking the original signal in the same way:
Here grows from a small value to a large value.
| Process | Form | Behavior |
|---|---|---|
| Variance-preserving | Signal shrinks, noise grows | |
| Variance-exploding | Signal remains, noise scale grows | |
| Sub-VP | Modified VP process | Often used for likelihood and SDE variants |
These families lead to different reverse dynamics and sampler designs.
Log-SNR Parameterization
Many modern diffusion formulations parameterize noise using log-SNR:
This is useful because can span many orders of magnitude. Taking the logarithm gives a more numerically manageable scale.
From log-SNR, we can recover:
where is the logistic sigmoid function:
Log-SNR is especially useful in continuous-time diffusion, velocity prediction, and modern sampler analysis.
Timestep Sampling During Training
The schedule defines available noise levels, but training also requires choosing which timesteps to sample.
The simplest choice is uniform timestep sampling:
In PyTorch:
t = torch.randint(0, T, (batch_size,), device=device)Uniform sampling works well enough for many models. However, not all timesteps contribute equally to learning. Some noise levels may have larger gradients or more difficult prediction targets.
Alternative strategies include:
| Strategy | Idea |
|---|---|
| Uniform sampling | Sample all timesteps equally |
| Loss-aware sampling | Sample timesteps with high loss more often |
| SNR-weighted objectives | Reweight loss by noise level |
| Importance sampling | Allocate training to useful noise regimes |
| Continuous noise sampling | Sample or log-SNR from a continuous distribution |
Timestep sampling and loss weighting are closely linked. Changing either one changes which noise levels dominate training.
Loss Weighting and Schedule Interaction
The usual noise prediction loss is
Although this loss appears uniform across timesteps, the effective learning pressure depends on the schedule.
At high SNR, the model sees samples close to data and must predict small corruption. At low SNR, the model sees almost pure noise and must infer structure from little signal.
Some training objectives use explicit SNR weights:
Common choices downweight very high or very low SNR regions to prevent unstable or unhelpful gradients.
For example, min-SNR weighting clips the SNR contribution so that intermediate noise levels receive stronger emphasis.
Schedule Effects on Sample Quality
The schedule affects generation in several ways.
If noise grows too quickly, the model loses information early. Denoising becomes difficult because adjacent timesteps differ too much.
If noise grows too slowly, many timesteps are wasted on almost identical noise levels. Training and sampling become inefficient.
If the final noise level is insufficient, retains data information. This breaks the assumption that generation can start from standard Gaussian noise.
A well-designed schedule should satisfy:
| Requirement | Reason |
|---|---|
| Smooth corruption | Reverse transitions remain learnable |
| Full destruction by | Sampling can start from Gaussian noise |
| Useful SNR coverage | Model learns coarse and fine denoising |
| Numerical stability | Avoid extreme coefficients |
| Compatibility with sampler | Reverse solver works accurately |
Schedules for Latent Diffusion
Latent diffusion applies the diffusion process in a compressed latent space instead of pixel space.
The forward equation is unchanged:
However, the latent distribution may have different statistics than pixel data. If the autoencoder normalizes latents carefully, a standard schedule may work. If latent scale differs, the schedule may need adjustment.
In practice, latent diffusion often uses schedules inherited from image diffusion but tuned for the latent representation and sampler.
The key point is that schedules operate on the representation being diffused, not on the original raw data.
Schedules and Fast Samplers
Training may use many timesteps, such as . Sampling often uses fewer steps, such as 20 to 50.
Fast samplers select a subset of noise levels from the training schedule:
The quality of fast sampling depends on how these timesteps are spaced.
Common spacing methods include:
| Spacing | Behavior |
|---|---|
| Uniform in timestep | Simple but may waste steps |
| Uniform in log-SNR | Often better coverage of denoising difficulty |
| Quadratic spacing | More steps near low-noise regions |
| Solver-adaptive spacing | Chosen by numerical solver |
For few-step generation, schedule design becomes more important. Each step must cover a larger interval in noise space.
Practical PyTorch Schedule Class
A small schedule helper can centralize the required tensors:
import torch
class DiffusionSchedule:
def __init__(self, betas):
self.betas = betas
self.alphas = 1.0 - betas
self.alpha_bars = torch.cumprod(self.alphas, dim=0)
self.sqrt_alpha_bars = torch.sqrt(self.alpha_bars)
self.sqrt_one_minus_alpha_bars = torch.sqrt(
1.0 - self.alpha_bars
)
def to(self, device):
for name, value in vars(self).items():
if torch.is_tensor(value):
setattr(self, name, value.to(device))
return self
def extract(self, values, t, x_shape):
batch_size = t.shape[0]
out = values.gather(0, t)
return out.reshape(
batch_size,
*((1,) * (len(x_shape) - 1))
)
def q_sample(self, x0, t, noise=None):
if noise is None:
noise = torch.randn_like(x0)
a = self.extract(
self.sqrt_alpha_bars,
t,
x0.shape
)
b = self.extract(
self.sqrt_one_minus_alpha_bars,
t,
x0.shape
)
return a * x0 + b * noiseUsage:
betas = torch.linspace(1e-4, 2e-2, 1000)
schedule = DiffusionSchedule(betas).to(device)
x_t = schedule.q_sample(x0, t)This structure keeps schedule math separate from model code.
Common Implementation Errors
Noise schedules are simple, but implementation mistakes are common.
| Error | Consequence |
|---|---|
| Off-by-one timestep indexing | Wrong noise level during training or sampling |
| Forgetting cumulative product | Uses one-step noise instead of total noise |
| Wrong device placement | CPU/GPU tensor mismatch |
| Wrong broadcast shape | Schedule values applied incorrectly |
| Excessive | Numerical instability |
| Final too large | Endpoint still contains signal |
| Mixing zero-based and one-based notation | Incorrect formulas in code |
A useful check is to inspect , , and SNR over time. They should change smoothly and monotonically.
Summary
A noise schedule defines how the forward diffusion process corrupts data. The basic schedule is the sequence of one-step variances , but most analysis is clearer through the cumulative signal coefficient and SNR.
Linear schedules are simple. Cosine schedules often allocate denoising difficulty more effectively. Continuous-time models describe schedules with smooth noise functions or log-SNR curves.
Schedule design affects training stability, sampling quality, and inference speed. A good schedule makes the reverse process learnable, destroys the data distribution by the final timestep, and covers useful noise levels for both coarse structure and fine detail.