Noise Schedules

A diffusion model needs a rule for how noise increases during the forward process. This rule is called the noise schedule. It determines how quickly clean data is corrupted, how much signal remains at each timestep, and how difficult each denoising task becomes.

The forward process is

q(x_t\mid x_{t-1}) = \mathcal{N} \left( x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I \right).

The sequence

\beta_1,\beta_2,\ldots,\beta_T

is the noise schedule. Each $\beta_t$ controls the variance of Gaussian noise added at step $t$ .

A good schedule should corrupt data gradually. Early timesteps should preserve most of the signal. Later timesteps should remove almost all information so that $x_T$ becomes close to standard Gaussian noise.

From Step Noise to Cumulative Noise

The one-step noise coefficient is $\beta_t$ . It is often more useful to track the cumulative signal that remains after many steps.

Define

\alpha_t = 1-\beta_t.

Then define the cumulative product

\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s.

The direct sampling formula is

x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I).

The term $\bar{\alpha}_t$ controls the remaining signal power. The term $1-\bar{\alpha}_t$ controls the noise power.

Thus the schedule can be described in two equivalent ways:

Quantity	Meaning
$\beta_t$	Noise added at one step
$\alpha_t$	Signal retained at one step
$\bar{\alpha}_t$	Signal retained after $t$ steps
$1-\bar{\alpha}_t$	Noise accumulated after $t$ steps

Most schedule design is easier to understand through $\bar{\alpha}_t$ or signal-to-noise ratio.

Signal-to-Noise Ratio

The signal-to-noise ratio at timestep $t$ is

\mathrm{SNR}(t) = \frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.

When $\bar{\alpha}_t$ is close to 1, the SNR is high. The noisy sample still resembles the original data. When $\bar{\alpha}_t$ is close to 0, the SNR is low. The sample is mostly noise.

SNR gives a clearer view of task difficulty.

Region	SNR	Denoising task
Early timesteps	High	Remove small perturbations
Middle timesteps	Moderate	Recover structure and texture
Late timesteps	Low	Infer global semantics from weak signal

The reverse model must learn all three regimes. If the schedule spends too few steps in one region, the model may become weak there.

Linear Beta Schedule

The simplest schedule increases $\beta_t$ linearly:

\beta_t = \beta_{\min} + \frac{t-1}{T-1} (\beta_{\max}-\beta_{\min}).

A common choice is

\beta_{\min}=10^{-4}, \qquad \beta_{\max}=2\times10^{-2}.

In PyTorch:

import torch

T = 1000

beta_min = 1e-4
beta_max = 2e-2

betas = torch.linspace(beta_min, beta_max, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)

The linear beta schedule is easy to implement and works reasonably well. It was used in early denoising diffusion probabilistic models.

However, linear growth in $\beta_t$ does not imply linear change in perceptual noise or SNR. Because $\bar{\alpha}_t$ is a product over many $\alpha_t$ values, the cumulative signal may decay unevenly.

Cosine Schedule

The cosine schedule defines the cumulative signal directly:

\bar{\alpha}_t = \frac{ f(t) }{ f(0) },

where

f(t) = \cos^2 \left( \frac{t/T+s}{1+s} \cdot \frac{\pi}{2} \right).

The small constant $s$ prevents the schedule from changing too abruptly near $t=0$ .

This schedule tends to preserve signal more gently at early timesteps and produce better sample quality in many image models.

In PyTorch:

import math
import torch

def cosine_alpha_bars(T, s=0.008):
    steps = torch.arange(T + 1, dtype=torch.float32)
    x = steps / T

    alpha_bars = torch.cos(
        ((x + s) / (1 + s)) * math.pi / 2
    ) ** 2

    alpha_bars = alpha_bars / alpha_bars[0]
    return alpha_bars

To recover $\beta_t$ , use the relation

\beta_t = 1- \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}.

In code:

def betas_from_alpha_bars(alpha_bars, max_beta=0.999):
    betas = []

    for t in range(1, len(alpha_bars)):
        beta = 1.0 - alpha_bars[t] / alpha_bars[t - 1]
        beta = min(beta.item(), max_beta)
        betas.append(beta)

    return torch.tensor(betas, dtype=torch.float32)

alpha_bars = cosine_alpha_bars(T)
betas = betas_from_alpha_bars(alpha_bars)

The cosine schedule is commonly used because it allocates noise levels more evenly in terms of useful denoising difficulty.

Quadratic and Sigmoid Schedules

A quadratic schedule makes $\beta_t$ grow slowly at first and faster later.

One simple construction is

\beta_t = \left( \sqrt{\beta_{\min}} + \frac{t-1}{T-1} (\sqrt{\beta_{\max}}-\sqrt{\beta_{\min}}) \right)^2.

In PyTorch:

def quadratic_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
    return torch.linspace(
        beta_min ** 0.5,
        beta_max ** 0.5,
        T
    ) ** 2

A sigmoid schedule changes slowly near the beginning and end, and faster in the middle:

def sigmoid_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
    x = torch.linspace(-6, 6, T)
    betas = torch.sigmoid(x)
    betas = betas * (beta_max - beta_min) + beta_min
    return betas

These schedules are less canonical than linear and cosine schedules, but they illustrate the design freedom. What matters most is the induced path of $\bar{\alpha}_t$ and SNR.

Schedules in Continuous Time

Discrete diffusion uses timesteps

t=1,\ldots,T.

Continuous-time diffusion replaces the discrete schedule with continuous functions. Instead of $\beta_t$ , we define a noise rate $\beta(t)$ , where

t\in[0,1].

A common continuous forward SDE is

dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dw.

The function $\beta(t)$ controls the rate at which noise is added.

Continuous schedules are useful because they allow the reverse process to be solved with numerical ODE or SDE solvers. They also separate the training noise distribution from the number of sampling steps.

Variance-Preserving and Variance-Exploding Schedules

Two major families of score-based diffusion processes are variance-preserving and variance-exploding schedules.

In a variance-preserving process, the total variance of $x_t$ remains approximately constant. The DDPM forward process is variance-preserving because it scales the signal while adding noise:

x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

If $x_0$ and $\epsilon$ have unit variance, then $x_t$ also has unit variance.

In a variance-exploding process, noise variance increases over time without shrinking the original signal in the same way:

x_t = x_0 + \sigma(t)\epsilon.

Here $\sigma(t)$ grows from a small value to a large value.

Process	Form	Behavior
Variance-preserving	$\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon$	Signal shrinks, noise grows
Variance-exploding	$x_0+\sigma(t)\epsilon$	Signal remains, noise scale grows
Sub-VP	Modified VP process	Often used for likelihood and SDE variants

These families lead to different reverse dynamics and sampler designs.

Log-SNR Parameterization

Many modern diffusion formulations parameterize noise using log-SNR:

\lambda_t = \log \frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.

This is useful because $\mathrm{SNR}(t)$ can span many orders of magnitude. Taking the logarithm gives a more numerically manageable scale.

From log-SNR, we can recover:

\bar{\alpha}_t = \sigma(\lambda_t),

where $\sigma$ is the logistic sigmoid function:

\sigma(\lambda) = \frac{1}{1+\exp(-\lambda)}.

Log-SNR is especially useful in continuous-time diffusion, velocity prediction, and modern sampler analysis.

Timestep Sampling During Training

The schedule defines available noise levels, but training also requires choosing which timesteps to sample.

The simplest choice is uniform timestep sampling:

t \sim \mathrm{Uniform}\{1,\ldots,T\}.

In PyTorch:

t = torch.randint(0, T, (batch_size,), device=device)

Uniform sampling works well enough for many models. However, not all timesteps contribute equally to learning. Some noise levels may have larger gradients or more difficult prediction targets.

Alternative strategies include:

Strategy	Idea
Uniform sampling	Sample all timesteps equally
Loss-aware sampling	Sample timesteps with high loss more often
SNR-weighted objectives	Reweight loss by noise level
Importance sampling	Allocate training to useful noise regimes
Continuous noise sampling	Sample $\sigma$ or log-SNR from a continuous distribution

Timestep sampling and loss weighting are closely linked. Changing either one changes which noise levels dominate training.

Loss Weighting and Schedule Interaction

The usual noise prediction loss is

\mathcal{L} = \mathbb{E} \left[ \|\epsilon-\epsilon_\theta(x_t,t)\|_2^2 \right].

Although this loss appears uniform across timesteps, the effective learning pressure depends on the schedule.

At high SNR, the model sees samples close to data and must predict small corruption. At low SNR, the model sees almost pure noise and must infer structure from little signal.

Some training objectives use explicit SNR weights:

\mathcal{L} = w(t) \|\epsilon-\epsilon_\theta(x_t,t)\|_2^2.

Common choices downweight very high or very low SNR regions to prevent unstable or unhelpful gradients.

For example, min-SNR weighting clips the SNR contribution so that intermediate noise levels receive stronger emphasis.

Schedule Effects on Sample Quality

The schedule affects generation in several ways.

If noise grows too quickly, the model loses information early. Denoising becomes difficult because adjacent timesteps differ too much.

If noise grows too slowly, many timesteps are wasted on almost identical noise levels. Training and sampling become inefficient.

If the final noise level is insufficient, $x_T$ retains data information. This breaks the assumption that generation can start from standard Gaussian noise.

A well-designed schedule should satisfy:

Requirement	Reason
Smooth corruption	Reverse transitions remain learnable
Full destruction by $T$	Sampling can start from Gaussian noise
Useful SNR coverage	Model learns coarse and fine denoising
Numerical stability	Avoid extreme coefficients
Compatibility with sampler	Reverse solver works accurately

Schedules for Latent Diffusion

Latent diffusion applies the diffusion process in a compressed latent space instead of pixel space.

The forward equation is unchanged:

z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

However, the latent distribution may have different statistics than pixel data. If the autoencoder normalizes latents carefully, a standard schedule may work. If latent scale differs, the schedule may need adjustment.

In practice, latent diffusion often uses schedules inherited from image diffusion but tuned for the latent representation and sampler.

The key point is that schedules operate on the representation being diffused, not on the original raw data.

Schedules and Fast Samplers

Training may use many timesteps, such as $T=1000$ . Sampling often uses fewer steps, such as 20 to 50.

Fast samplers select a subset of noise levels from the training schedule:

T=t_K>t_{K-1}>\cdots>t_0=0.

The quality of fast sampling depends on how these timesteps are spaced.

Common spacing methods include:

Spacing	Behavior
Uniform in timestep	Simple but may waste steps
Uniform in log-SNR	Often better coverage of denoising difficulty
Quadratic spacing	More steps near low-noise regions
Solver-adaptive spacing	Chosen by numerical solver

For few-step generation, schedule design becomes more important. Each step must cover a larger interval in noise space.

Practical PyTorch Schedule Class

A small schedule helper can centralize the required tensors:

import torch

class DiffusionSchedule:
    def __init__(self, betas):
        self.betas = betas
        self.alphas = 1.0 - betas
        self.alpha_bars = torch.cumprod(self.alphas, dim=0)

        self.sqrt_alpha_bars = torch.sqrt(self.alpha_bars)
        self.sqrt_one_minus_alpha_bars = torch.sqrt(
            1.0 - self.alpha_bars
        )

    def to(self, device):
        for name, value in vars(self).items():
            if torch.is_tensor(value):
                setattr(self, name, value.to(device))
        return self

    def extract(self, values, t, x_shape):
        batch_size = t.shape[0]
        out = values.gather(0, t)
        return out.reshape(
            batch_size,
            *((1,) * (len(x_shape) - 1))
        )

    def q_sample(self, x0, t, noise=None):
        if noise is None:
            noise = torch.randn_like(x0)

        a = self.extract(
            self.sqrt_alpha_bars,
            t,
            x0.shape
        )

        b = self.extract(
            self.sqrt_one_minus_alpha_bars,
            t,
            x0.shape
        )

        return a * x0 + b * noise

Usage:

betas = torch.linspace(1e-4, 2e-2, 1000)
schedule = DiffusionSchedule(betas).to(device)

x_t = schedule.q_sample(x0, t)

This structure keeps schedule math separate from model code.

Common Implementation Errors

Noise schedules are simple, but implementation mistakes are common.

Error	Consequence
Off-by-one timestep indexing	Wrong noise level during training or sampling
Forgetting cumulative product	Uses one-step noise instead of total noise
Wrong device placement	CPU/GPU tensor mismatch
Wrong broadcast shape	Schedule values applied incorrectly
Excessive $\beta_t$	Numerical instability
Final $\bar{\alpha}_T$ too large	Endpoint still contains signal
Mixing zero-based and one-based notation	Incorrect formulas in code

A useful check is to inspect $\bar{\alpha}_t$ , $1-\bar{\alpha}_t$ , and SNR over time. They should change smoothly and monotonically.

Summary

A noise schedule defines how the forward diffusion process corrupts data. The basic schedule is the sequence of one-step variances $\beta_t$ , but most analysis is clearer through the cumulative signal coefficient $\bar{\alpha}_t$ and SNR.

Linear schedules are simple. Cosine schedules often allocate denoising difficulty more effectively. Continuous-time models describe schedules with smooth noise functions or log-SNR curves.

Schedule design affects training stability, sampling quality, and inference speed. A good schedule makes the reverse process learnable, destroys the data distribution by the final timestep, and covers useful noise levels for both coarse structure and fine detail.