Skip to content

Noise Schedules

A diffusion model needs a rule for how noise increases during the forward process.

A diffusion model needs a rule for how noise increases during the forward process. This rule is called the noise schedule. It determines how quickly clean data is corrupted, how much signal remains at each timestep, and how difficult each denoising task becomes.

The forward process is

q(xtxt1)=N(xt;1βtxt1,βtI). q(x_t\mid x_{t-1}) = \mathcal{N} \left( x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I \right).

The sequence

β1,β2,,βT \beta_1,\beta_2,\ldots,\beta_T

is the noise schedule. Each βt\beta_t controls the variance of Gaussian noise added at step tt.

A good schedule should corrupt data gradually. Early timesteps should preserve most of the signal. Later timesteps should remove almost all information so that xTx_T becomes close to standard Gaussian noise.

From Step Noise to Cumulative Noise

The one-step noise coefficient is βt\beta_t. It is often more useful to track the cumulative signal that remains after many steps.

Define

αt=1βt. \alpha_t = 1-\beta_t.

Then define the cumulative product

αˉt=s=1tαs. \bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s.

The direct sampling formula is

xt=αˉtx0+1αˉtϵ,ϵN(0,I). x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I).

The term αˉt\bar{\alpha}_t controls the remaining signal power. The term 1αˉt1-\bar{\alpha}_t controls the noise power.

Thus the schedule can be described in two equivalent ways:

QuantityMeaning
βt\beta_tNoise added at one step
αt\alpha_tSignal retained at one step
αˉt\bar{\alpha}_tSignal retained after tt steps
1αˉt1-\bar{\alpha}_tNoise accumulated after tt steps

Most schedule design is easier to understand through αˉt\bar{\alpha}_t or signal-to-noise ratio.

Signal-to-Noise Ratio

The signal-to-noise ratio at timestep tt is

SNR(t)=αˉt1αˉt. \mathrm{SNR}(t) = \frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.

When αˉt\bar{\alpha}_t is close to 1, the SNR is high. The noisy sample still resembles the original data. When αˉt\bar{\alpha}_t is close to 0, the SNR is low. The sample is mostly noise.

SNR gives a clearer view of task difficulty.

RegionSNRDenoising task
Early timestepsHighRemove small perturbations
Middle timestepsModerateRecover structure and texture
Late timestepsLowInfer global semantics from weak signal

The reverse model must learn all three regimes. If the schedule spends too few steps in one region, the model may become weak there.

Linear Beta Schedule

The simplest schedule increases βt\beta_t linearly:

βt=βmin+t1T1(βmaxβmin). \beta_t = \beta_{\min} + \frac{t-1}{T-1} (\beta_{\max}-\beta_{\min}).

A common choice is

βmin=104,βmax=2×102. \beta_{\min}=10^{-4}, \qquad \beta_{\max}=2\times10^{-2}.

In PyTorch:

import torch

T = 1000

beta_min = 1e-4
beta_max = 2e-2

betas = torch.linspace(beta_min, beta_max, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)

The linear beta schedule is easy to implement and works reasonably well. It was used in early denoising diffusion probabilistic models.

However, linear growth in βt\beta_t does not imply linear change in perceptual noise or SNR. Because αˉt\bar{\alpha}_t is a product over many αt\alpha_t values, the cumulative signal may decay unevenly.

Cosine Schedule

The cosine schedule defines the cumulative signal directly:

αˉt=f(t)f(0), \bar{\alpha}_t = \frac{ f(t) }{ f(0) },

where

f(t)=cos2(t/T+s1+sπ2). f(t) = \cos^2 \left( \frac{t/T+s}{1+s} \cdot \frac{\pi}{2} \right).

The small constant ss prevents the schedule from changing too abruptly near t=0t=0.

This schedule tends to preserve signal more gently at early timesteps and produce better sample quality in many image models.

In PyTorch:

import math
import torch

def cosine_alpha_bars(T, s=0.008):
    steps = torch.arange(T + 1, dtype=torch.float32)
    x = steps / T

    alpha_bars = torch.cos(
        ((x + s) / (1 + s)) * math.pi / 2
    ) ** 2

    alpha_bars = alpha_bars / alpha_bars[0]
    return alpha_bars

To recover βt\beta_t, use the relation

βt=1αˉtαˉt1. \beta_t = 1- \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}.

In code:

def betas_from_alpha_bars(alpha_bars, max_beta=0.999):
    betas = []

    for t in range(1, len(alpha_bars)):
        beta = 1.0 - alpha_bars[t] / alpha_bars[t - 1]
        beta = min(beta.item(), max_beta)
        betas.append(beta)

    return torch.tensor(betas, dtype=torch.float32)

alpha_bars = cosine_alpha_bars(T)
betas = betas_from_alpha_bars(alpha_bars)

The cosine schedule is commonly used because it allocates noise levels more evenly in terms of useful denoising difficulty.

Quadratic and Sigmoid Schedules

A quadratic schedule makes βt\beta_t grow slowly at first and faster later.

One simple construction is

βt=(βmin+t1T1(βmaxβmin))2. \beta_t = \left( \sqrt{\beta_{\min}} + \frac{t-1}{T-1} (\sqrt{\beta_{\max}}-\sqrt{\beta_{\min}}) \right)^2.

In PyTorch:

def quadratic_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
    return torch.linspace(
        beta_min ** 0.5,
        beta_max ** 0.5,
        T
    ) ** 2

A sigmoid schedule changes slowly near the beginning and end, and faster in the middle:

def sigmoid_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
    x = torch.linspace(-6, 6, T)
    betas = torch.sigmoid(x)
    betas = betas * (beta_max - beta_min) + beta_min
    return betas

These schedules are less canonical than linear and cosine schedules, but they illustrate the design freedom. What matters most is the induced path of αˉt\bar{\alpha}_t and SNR.

Schedules in Continuous Time

Discrete diffusion uses timesteps

t=1,,T. t=1,\ldots,T.

Continuous-time diffusion replaces the discrete schedule with continuous functions. Instead of βt\beta_t, we define a noise rate β(t)\beta(t), where

t[0,1]. t\in[0,1].

A common continuous forward SDE is

dx=12β(t)xdt+β(t)dw. dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}\,dw.

The function β(t)\beta(t) controls the rate at which noise is added.

Continuous schedules are useful because they allow the reverse process to be solved with numerical ODE or SDE solvers. They also separate the training noise distribution from the number of sampling steps.

Variance-Preserving and Variance-Exploding Schedules

Two major families of score-based diffusion processes are variance-preserving and variance-exploding schedules.

In a variance-preserving process, the total variance of xtx_t remains approximately constant. The DDPM forward process is variance-preserving because it scales the signal while adding noise:

xt=αˉtx0+1αˉtϵ. x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

If x0x_0 and ϵ\epsilon have unit variance, then xtx_t also has unit variance.

In a variance-exploding process, noise variance increases over time without shrinking the original signal in the same way:

xt=x0+σ(t)ϵ. x_t = x_0 + \sigma(t)\epsilon.

Here σ(t)\sigma(t) grows from a small value to a large value.

ProcessFormBehavior
Variance-preservingαˉtx0+1αˉtϵ\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilonSignal shrinks, noise grows
Variance-explodingx0+σ(t)ϵx_0+\sigma(t)\epsilonSignal remains, noise scale grows
Sub-VPModified VP processOften used for likelihood and SDE variants

These families lead to different reverse dynamics and sampler designs.

Log-SNR Parameterization

Many modern diffusion formulations parameterize noise using log-SNR:

λt=logαˉt1αˉt. \lambda_t = \log \frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.

This is useful because SNR(t)\mathrm{SNR}(t) can span many orders of magnitude. Taking the logarithm gives a more numerically manageable scale.

From log-SNR, we can recover:

αˉt=σ(λt), \bar{\alpha}_t = \sigma(\lambda_t),

where σ\sigma is the logistic sigmoid function:

σ(λ)=11+exp(λ). \sigma(\lambda) = \frac{1}{1+\exp(-\lambda)}.

Log-SNR is especially useful in continuous-time diffusion, velocity prediction, and modern sampler analysis.

Timestep Sampling During Training

The schedule defines available noise levels, but training also requires choosing which timesteps to sample.

The simplest choice is uniform timestep sampling:

tUniform{1,,T}. t \sim \mathrm{Uniform}\{1,\ldots,T\}.

In PyTorch:

t = torch.randint(0, T, (batch_size,), device=device)

Uniform sampling works well enough for many models. However, not all timesteps contribute equally to learning. Some noise levels may have larger gradients or more difficult prediction targets.

Alternative strategies include:

StrategyIdea
Uniform samplingSample all timesteps equally
Loss-aware samplingSample timesteps with high loss more often
SNR-weighted objectivesReweight loss by noise level
Importance samplingAllocate training to useful noise regimes
Continuous noise samplingSample σ\sigma or log-SNR from a continuous distribution

Timestep sampling and loss weighting are closely linked. Changing either one changes which noise levels dominate training.

Loss Weighting and Schedule Interaction

The usual noise prediction loss is

L=E[ϵϵθ(xt,t)22]. \mathcal{L} = \mathbb{E} \left[ \|\epsilon-\epsilon_\theta(x_t,t)\|_2^2 \right].

Although this loss appears uniform across timesteps, the effective learning pressure depends on the schedule.

At high SNR, the model sees samples close to data and must predict small corruption. At low SNR, the model sees almost pure noise and must infer structure from little signal.

Some training objectives use explicit SNR weights:

L=w(t)ϵϵθ(xt,t)22. \mathcal{L} = w(t) \|\epsilon-\epsilon_\theta(x_t,t)\|_2^2.

Common choices downweight very high or very low SNR regions to prevent unstable or unhelpful gradients.

For example, min-SNR weighting clips the SNR contribution so that intermediate noise levels receive stronger emphasis.

Schedule Effects on Sample Quality

The schedule affects generation in several ways.

If noise grows too quickly, the model loses information early. Denoising becomes difficult because adjacent timesteps differ too much.

If noise grows too slowly, many timesteps are wasted on almost identical noise levels. Training and sampling become inefficient.

If the final noise level is insufficient, xTx_T retains data information. This breaks the assumption that generation can start from standard Gaussian noise.

A well-designed schedule should satisfy:

RequirementReason
Smooth corruptionReverse transitions remain learnable
Full destruction by TTSampling can start from Gaussian noise
Useful SNR coverageModel learns coarse and fine denoising
Numerical stabilityAvoid extreme coefficients
Compatibility with samplerReverse solver works accurately

Schedules for Latent Diffusion

Latent diffusion applies the diffusion process in a compressed latent space instead of pixel space.

The forward equation is unchanged:

zt=αˉtz0+1αˉtϵ. z_t = \sqrt{\bar{\alpha}_t}z_0 + \sqrt{1-\bar{\alpha}_t}\epsilon.

However, the latent distribution may have different statistics than pixel data. If the autoencoder normalizes latents carefully, a standard schedule may work. If latent scale differs, the schedule may need adjustment.

In practice, latent diffusion often uses schedules inherited from image diffusion but tuned for the latent representation and sampler.

The key point is that schedules operate on the representation being diffused, not on the original raw data.

Schedules and Fast Samplers

Training may use many timesteps, such as T=1000T=1000. Sampling often uses fewer steps, such as 20 to 50.

Fast samplers select a subset of noise levels from the training schedule:

T=tK>tK1>>t0=0. T=t_K>t_{K-1}>\cdots>t_0=0.

The quality of fast sampling depends on how these timesteps are spaced.

Common spacing methods include:

SpacingBehavior
Uniform in timestepSimple but may waste steps
Uniform in log-SNROften better coverage of denoising difficulty
Quadratic spacingMore steps near low-noise regions
Solver-adaptive spacingChosen by numerical solver

For few-step generation, schedule design becomes more important. Each step must cover a larger interval in noise space.

Practical PyTorch Schedule Class

A small schedule helper can centralize the required tensors:

import torch

class DiffusionSchedule:
    def __init__(self, betas):
        self.betas = betas
        self.alphas = 1.0 - betas
        self.alpha_bars = torch.cumprod(self.alphas, dim=0)

        self.sqrt_alpha_bars = torch.sqrt(self.alpha_bars)
        self.sqrt_one_minus_alpha_bars = torch.sqrt(
            1.0 - self.alpha_bars
        )

    def to(self, device):
        for name, value in vars(self).items():
            if torch.is_tensor(value):
                setattr(self, name, value.to(device))
        return self

    def extract(self, values, t, x_shape):
        batch_size = t.shape[0]
        out = values.gather(0, t)
        return out.reshape(
            batch_size,
            *((1,) * (len(x_shape) - 1))
        )

    def q_sample(self, x0, t, noise=None):
        if noise is None:
            noise = torch.randn_like(x0)

        a = self.extract(
            self.sqrt_alpha_bars,
            t,
            x0.shape
        )

        b = self.extract(
            self.sqrt_one_minus_alpha_bars,
            t,
            x0.shape
        )

        return a * x0 + b * noise

Usage:

betas = torch.linspace(1e-4, 2e-2, 1000)
schedule = DiffusionSchedule(betas).to(device)

x_t = schedule.q_sample(x0, t)

This structure keeps schedule math separate from model code.

Common Implementation Errors

Noise schedules are simple, but implementation mistakes are common.

ErrorConsequence
Off-by-one timestep indexingWrong noise level during training or sampling
Forgetting cumulative productUses one-step noise instead of total noise
Wrong device placementCPU/GPU tensor mismatch
Wrong broadcast shapeSchedule values applied incorrectly
Excessive βt\beta_tNumerical instability
Final αˉT\bar{\alpha}_T too largeEndpoint still contains signal
Mixing zero-based and one-based notationIncorrect formulas in code

A useful check is to inspect αˉt\bar{\alpha}_t, 1αˉt1-\bar{\alpha}_t, and SNR over time. They should change smoothly and monotonically.

Summary

A noise schedule defines how the forward diffusion process corrupts data. The basic schedule is the sequence of one-step variances βt\beta_t, but most analysis is clearer through the cumulative signal coefficient αˉt\bar{\alpha}_t and SNR.

Linear schedules are simple. Cosine schedules often allocate denoising difficulty more effectively. Continuous-time models describe schedules with smooth noise functions or log-SNR curves.

Schedule design affects training stability, sampling quality, and inference speed. A good schedule makes the reverse process learnable, destroys the data distribution by the final timestep, and covers useful noise levels for both coarse structure and fine detail.