# Noise Schedules

A diffusion model needs a rule for how noise increases during the forward process. This rule is called the noise schedule. It determines how quickly clean data is corrupted, how much signal remains at each timestep, and how difficult each denoising task becomes.

The forward process is

$$
q(x_t\mid x_{t-1}) =
\mathcal{N}
\left(
x_t;
\sqrt{1-\beta_t}x_{t-1},
\beta_t I
\right).
$$

The sequence

$$
\beta_1,\beta_2,\ldots,\beta_T
$$

is the noise schedule. Each $\beta_t$ controls the variance of Gaussian noise added at step $t$.

A good schedule should corrupt data gradually. Early timesteps should preserve most of the signal. Later timesteps should remove almost all information so that $x_T$ becomes close to standard Gaussian noise.

### From Step Noise to Cumulative Noise

The one-step noise coefficient is $\beta_t$. It is often more useful to track the cumulative signal that remains after many steps.

Define

$$
\alpha_t = 1-\beta_t.
$$

Then define the cumulative product

$$
\bar{\alpha}_t =
\prod_{s=1}^{t}\alpha_s.
$$

The direct sampling formula is

$$
x_t =
\sqrt{\bar{\alpha}_t}x_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).
$$

The term $\bar{\alpha}_t$ controls the remaining signal power. The term $1-\bar{\alpha}_t$ controls the noise power.

Thus the schedule can be described in two equivalent ways:

| Quantity | Meaning |
|---|---|
| $\beta_t$ | Noise added at one step |
| $\alpha_t$ | Signal retained at one step |
| $\bar{\alpha}_t$ | Signal retained after $t$ steps |
| $1-\bar{\alpha}_t$ | Noise accumulated after $t$ steps |

Most schedule design is easier to understand through $\bar{\alpha}_t$ or signal-to-noise ratio.

### Signal-to-Noise Ratio

The signal-to-noise ratio at timestep $t$ is

$$
\mathrm{SNR}(t) =
\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.
$$

When $\bar{\alpha}_t$ is close to 1, the SNR is high. The noisy sample still resembles the original data. When $\bar{\alpha}_t$ is close to 0, the SNR is low. The sample is mostly noise.

SNR gives a clearer view of task difficulty.

| Region | SNR | Denoising task |
|---|---:|---|
| Early timesteps | High | Remove small perturbations |
| Middle timesteps | Moderate | Recover structure and texture |
| Late timesteps | Low | Infer global semantics from weak signal |

The reverse model must learn all three regimes. If the schedule spends too few steps in one region, the model may become weak there.

### Linear Beta Schedule

The simplest schedule increases $\beta_t$ linearly:

$$
\beta_t =
\beta_{\min}
+
\frac{t-1}{T-1}
(\beta_{\max}-\beta_{\min}).
$$

A common choice is

$$
\beta_{\min}=10^{-4},
\qquad
\beta_{\max}=2\times10^{-2}.
$$

In PyTorch:

```python
import torch

T = 1000

beta_min = 1e-4
beta_max = 2e-2

betas = torch.linspace(beta_min, beta_max, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)
```

The linear beta schedule is easy to implement and works reasonably well. It was used in early denoising diffusion probabilistic models.

However, linear growth in $\beta_t$ does not imply linear change in perceptual noise or SNR. Because $\bar{\alpha}_t$ is a product over many $\alpha_t$ values, the cumulative signal may decay unevenly.

### Cosine Schedule

The cosine schedule defines the cumulative signal directly:

$$
\bar{\alpha}_t =
\frac{
f(t)
}{
f(0)
},
$$

where

$$
f(t) =
\cos^2
\left(
\frac{t/T+s}{1+s}
\cdot
\frac{\pi}{2}
\right).
$$

The small constant $s$ prevents the schedule from changing too abruptly near $t=0$.

This schedule tends to preserve signal more gently at early timesteps and produce better sample quality in many image models.

In PyTorch:

```python
import math
import torch

def cosine_alpha_bars(T, s=0.008):
    steps = torch.arange(T + 1, dtype=torch.float32)
    x = steps / T

    alpha_bars = torch.cos(
        ((x + s) / (1 + s)) * math.pi / 2
    ) ** 2

    alpha_bars = alpha_bars / alpha_bars[0]
    return alpha_bars
```

To recover $\beta_t$, use the relation

$$
\beta_t =
1-
\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}.
$$

In code:

```python
def betas_from_alpha_bars(alpha_bars, max_beta=0.999):
    betas = []

    for t in range(1, len(alpha_bars)):
        beta = 1.0 - alpha_bars[t] / alpha_bars[t - 1]
        beta = min(beta.item(), max_beta)
        betas.append(beta)

    return torch.tensor(betas, dtype=torch.float32)

alpha_bars = cosine_alpha_bars(T)
betas = betas_from_alpha_bars(alpha_bars)
```

The cosine schedule is commonly used because it allocates noise levels more evenly in terms of useful denoising difficulty.

### Quadratic and Sigmoid Schedules

A quadratic schedule makes $\beta_t$ grow slowly at first and faster later.

One simple construction is

$$
\beta_t =
\left(
\sqrt{\beta_{\min}}
+
\frac{t-1}{T-1}
(\sqrt{\beta_{\max}}-\sqrt{\beta_{\min}})
\right)^2.
$$

In PyTorch:

```python
def quadratic_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
    return torch.linspace(
        beta_min ** 0.5,
        beta_max ** 0.5,
        T
    ) ** 2
```

A sigmoid schedule changes slowly near the beginning and end, and faster in the middle:

```python
def sigmoid_beta_schedule(T, beta_min=1e-4, beta_max=2e-2):
    x = torch.linspace(-6, 6, T)
    betas = torch.sigmoid(x)
    betas = betas * (beta_max - beta_min) + beta_min
    return betas
```

These schedules are less canonical than linear and cosine schedules, but they illustrate the design freedom. What matters most is the induced path of $\bar{\alpha}_t$ and SNR.

### Schedules in Continuous Time

Discrete diffusion uses timesteps

$$
t=1,\ldots,T.
$$

Continuous-time diffusion replaces the discrete schedule with continuous functions. Instead of $\beta_t$, we define a noise rate $\beta(t)$, where

$$
t\in[0,1].
$$

A common continuous forward SDE is

$$
dx =
-\frac{1}{2}\beta(t)x\,dt
+
\sqrt{\beta(t)}\,dw.
$$

The function $\beta(t)$ controls the rate at which noise is added.

Continuous schedules are useful because they allow the reverse process to be solved with numerical ODE or SDE solvers. They also separate the training noise distribution from the number of sampling steps.

### Variance-Preserving and Variance-Exploding Schedules

Two major families of score-based diffusion processes are variance-preserving and variance-exploding schedules.

In a variance-preserving process, the total variance of $x_t$ remains approximately constant. The DDPM forward process is variance-preserving because it scales the signal while adding noise:

$$
x_t =
\sqrt{\bar{\alpha}_t}x_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

If $x_0$ and $\epsilon$ have unit variance, then $x_t$ also has unit variance.

In a variance-exploding process, noise variance increases over time without shrinking the original signal in the same way:

$$
x_t = x_0 + \sigma(t)\epsilon.
$$

Here $\sigma(t)$ grows from a small value to a large value.

| Process | Form | Behavior |
|---|---|---|
| Variance-preserving | $\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon$ | Signal shrinks, noise grows |
| Variance-exploding | $x_0+\sigma(t)\epsilon$ | Signal remains, noise scale grows |
| Sub-VP | Modified VP process | Often used for likelihood and SDE variants |

These families lead to different reverse dynamics and sampler designs.

### Log-SNR Parameterization

Many modern diffusion formulations parameterize noise using log-SNR:

$$
\lambda_t =
\log
\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.
$$

This is useful because $\mathrm{SNR}(t)$ can span many orders of magnitude. Taking the logarithm gives a more numerically manageable scale.

From log-SNR, we can recover:

$$
\bar{\alpha}_t =
\sigma(\lambda_t),
$$

where $\sigma$ is the logistic sigmoid function:

$$
\sigma(\lambda) =
\frac{1}{1+\exp(-\lambda)}.
$$

Log-SNR is especially useful in continuous-time diffusion, velocity prediction, and modern sampler analysis.

### Timestep Sampling During Training

The schedule defines available noise levels, but training also requires choosing which timesteps to sample.

The simplest choice is uniform timestep sampling:

$$
t \sim \mathrm{Uniform}\{1,\ldots,T\}.
$$

In PyTorch:

```python
t = torch.randint(0, T, (batch_size,), device=device)
```

Uniform sampling works well enough for many models. However, not all timesteps contribute equally to learning. Some noise levels may have larger gradients or more difficult prediction targets.

Alternative strategies include:

| Strategy | Idea |
|---|---|
| Uniform sampling | Sample all timesteps equally |
| Loss-aware sampling | Sample timesteps with high loss more often |
| SNR-weighted objectives | Reweight loss by noise level |
| Importance sampling | Allocate training to useful noise regimes |
| Continuous noise sampling | Sample $\sigma$ or log-SNR from a continuous distribution |

Timestep sampling and loss weighting are closely linked. Changing either one changes which noise levels dominate training.

### Loss Weighting and Schedule Interaction

The usual noise prediction loss is

$$
\mathcal{L} =
\mathbb{E}
\left[
\|\epsilon-\epsilon_\theta(x_t,t)\|_2^2
\right].
$$

Although this loss appears uniform across timesteps, the effective learning pressure depends on the schedule.

At high SNR, the model sees samples close to data and must predict small corruption. At low SNR, the model sees almost pure noise and must infer structure from little signal.

Some training objectives use explicit SNR weights:

$$
\mathcal{L} =
w(t)
\|\epsilon-\epsilon_\theta(x_t,t)\|_2^2.
$$

Common choices downweight very high or very low SNR regions to prevent unstable or unhelpful gradients.

For example, min-SNR weighting clips the SNR contribution so that intermediate noise levels receive stronger emphasis.

### Schedule Effects on Sample Quality

The schedule affects generation in several ways.

If noise grows too quickly, the model loses information early. Denoising becomes difficult because adjacent timesteps differ too much.

If noise grows too slowly, many timesteps are wasted on almost identical noise levels. Training and sampling become inefficient.

If the final noise level is insufficient, $x_T$ retains data information. This breaks the assumption that generation can start from standard Gaussian noise.

A well-designed schedule should satisfy:

| Requirement | Reason |
|---|---|
| Smooth corruption | Reverse transitions remain learnable |
| Full destruction by $T$ | Sampling can start from Gaussian noise |
| Useful SNR coverage | Model learns coarse and fine denoising |
| Numerical stability | Avoid extreme coefficients |
| Compatibility with sampler | Reverse solver works accurately |

### Schedules for Latent Diffusion

Latent diffusion applies the diffusion process in a compressed latent space instead of pixel space.

The forward equation is unchanged:

$$
z_t =
\sqrt{\bar{\alpha}_t}z_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

However, the latent distribution may have different statistics than pixel data. If the autoencoder normalizes latents carefully, a standard schedule may work. If latent scale differs, the schedule may need adjustment.

In practice, latent diffusion often uses schedules inherited from image diffusion but tuned for the latent representation and sampler.

The key point is that schedules operate on the representation being diffused, not on the original raw data.

### Schedules and Fast Samplers

Training may use many timesteps, such as $T=1000$. Sampling often uses fewer steps, such as 20 to 50.

Fast samplers select a subset of noise levels from the training schedule:

$$
T=t_K>t_{K-1}>\cdots>t_0=0.
$$

The quality of fast sampling depends on how these timesteps are spaced.

Common spacing methods include:

| Spacing | Behavior |
|---|---|
| Uniform in timestep | Simple but may waste steps |
| Uniform in log-SNR | Often better coverage of denoising difficulty |
| Quadratic spacing | More steps near low-noise regions |
| Solver-adaptive spacing | Chosen by numerical solver |

For few-step generation, schedule design becomes more important. Each step must cover a larger interval in noise space.

### Practical PyTorch Schedule Class

A small schedule helper can centralize the required tensors:

```python
import torch

class DiffusionSchedule:
    def __init__(self, betas):
        self.betas = betas
        self.alphas = 1.0 - betas
        self.alpha_bars = torch.cumprod(self.alphas, dim=0)

        self.sqrt_alpha_bars = torch.sqrt(self.alpha_bars)
        self.sqrt_one_minus_alpha_bars = torch.sqrt(
            1.0 - self.alpha_bars
        )

    def to(self, device):
        for name, value in vars(self).items():
            if torch.is_tensor(value):
                setattr(self, name, value.to(device))
        return self

    def extract(self, values, t, x_shape):
        batch_size = t.shape[0]
        out = values.gather(0, t)
        return out.reshape(
            batch_size,
            *((1,) * (len(x_shape) - 1))
        )

    def q_sample(self, x0, t, noise=None):
        if noise is None:
            noise = torch.randn_like(x0)

        a = self.extract(
            self.sqrt_alpha_bars,
            t,
            x0.shape
        )

        b = self.extract(
            self.sqrt_one_minus_alpha_bars,
            t,
            x0.shape
        )

        return a * x0 + b * noise
```

Usage:

```python
betas = torch.linspace(1e-4, 2e-2, 1000)
schedule = DiffusionSchedule(betas).to(device)

x_t = schedule.q_sample(x0, t)
```

This structure keeps schedule math separate from model code.

### Common Implementation Errors

Noise schedules are simple, but implementation mistakes are common.

| Error | Consequence |
|---|---|
| Off-by-one timestep indexing | Wrong noise level during training or sampling |
| Forgetting cumulative product | Uses one-step noise instead of total noise |
| Wrong device placement | CPU/GPU tensor mismatch |
| Wrong broadcast shape | Schedule values applied incorrectly |
| Excessive $\beta_t$ | Numerical instability |
| Final $\bar{\alpha}_T$ too large | Endpoint still contains signal |
| Mixing zero-based and one-based notation | Incorrect formulas in code |

A useful check is to inspect $\bar{\alpha}_t$, $1-\bar{\alpha}_t$, and SNR over time. They should change smoothly and monotonically.

### Summary

A noise schedule defines how the forward diffusion process corrupts data. The basic schedule is the sequence of one-step variances $\beta_t$, but most analysis is clearer through the cumulative signal coefficient $\bar{\alpha}_t$ and SNR.

Linear schedules are simple. Cosine schedules often allocate denoising difficulty more effectively. Continuous-time models describe schedules with smooth noise functions or log-SNR curves.

Schedule design affects training stability, sampling quality, and inference speed. A good schedule makes the reverse process learnable, destroys the data distribution by the final timestep, and covers useful noise levels for both coarse structure and fine detail.