# Forward Diffusion Processes

Diffusion models are generative models built around a simple idea: learn to reverse a gradual corruption process. The forward process starts with a clean data sample and repeatedly adds noise. After many small noise steps, the sample becomes almost indistinguishable from pure Gaussian noise. The model is then trained to invert this process, step by step, until noise becomes data.

In this section, we study the forward diffusion process. This process is fixed. It has no learned parameters in the usual denoising diffusion probabilistic model. Its job is to define how data is destroyed. The reverse model will later learn how to undo that destruction.

### Data as a Random Variable

Let $x_0$ denote a clean data sample. For an image model, $x_0$ may be an image tensor. For an audio model, it may be a waveform. For a latent diffusion model, it may be a latent representation produced by an encoder.

We treat $x_0$ as a random variable drawn from the data distribution:

$$
x_0 \sim q(x_0).
$$

The distribution $q(x_0)$ is the true data distribution. We cannot usually write it down in closed form. We only have samples from it, such as images in a dataset.

The purpose of a generative model is to learn a distribution $p_\theta(x_0)$ that approximates $q(x_0)$. Diffusion models do this indirectly. Instead of learning to produce clean data in one step, they learn a chain of denoising transitions.

The forward process defines a sequence of increasingly noisy variables:

$$
x_0, x_1, x_2, \ldots, x_T.
$$

Here $x_t$ is the noisy version of the original data after $t$ diffusion steps. The index $t$ is often called the timestep.

### The Markov Forward Process

The standard forward diffusion process is a Markov chain. This means that each noisy state depends only on the previous state:

$$
q(x_t \mid x_{t-1}, x_{t-2}, \ldots, x_0) =
q(x_t \mid x_{t-1}).
$$

The full forward process factors as

$$
q(x_{1:T}\mid x_0) =
\prod_{t=1}^{T} q(x_t\mid x_{t-1}).
$$

Each transition adds a small amount of Gaussian noise:

$$
q(x_t\mid x_{t-1}) =
\mathcal{N}
\left(
x_t;
\sqrt{1-\beta_t}\,x_{t-1},
\beta_t I
\right).
$$

The scalar $\beta_t\in(0,1)$ controls how much noise is added at step $t$. The identity matrix $I$ means that independent Gaussian noise is added to each coordinate of the tensor.

Equivalently, we can sample $x_t$ by writing

$$
x_t =
\sqrt{1-\beta_t}\,x_{t-1}
+
\sqrt{\beta_t}\,\epsilon_t,
\qquad
\epsilon_t\sim\mathcal{N}(0,I).
$$

This equation says that $x_t$ is a mixture of a slightly scaled version of $x_{t-1}$ and fresh Gaussian noise.

The scaling factor $\sqrt{1-\beta_t}$ is important. Without it, the variance of $x_t$ would grow uncontrollably as noise is added. The scaling keeps the process numerically stable.

### Noise Schedules

The sequence

$$
\beta_1,\beta_2,\ldots,\beta_T
$$

is called the noise schedule.

A small $\beta_t$ adds little noise. A large $\beta_t$ adds more noise. Usually, the schedule starts with small noise and increases over time.

Common schedules include:

| Schedule | Description |
|---|---|
| Linear schedule | $\beta_t$ increases linearly from a small value to a larger value |
| Cosine schedule | Noise grows according to a cosine-shaped cumulative schedule |
| Quadratic schedule | Noise increases slowly at first and faster later |
| Sigmoid schedule | Noise changes slowly near the beginning and end, faster in the middle |

The schedule affects both training and sampling. If noise is added too quickly, the reverse process becomes hard to learn. If noise is added too slowly, training and sampling require many steps.

In early diffusion models, $T$ was often set to 1000. Modern samplers may use fewer reverse steps during inference, but the forward training formulation still depends on a chosen continuous or discrete noise schedule.

### Alpha Notation

It is convenient to define

$$
\alpha_t = 1-\beta_t.
$$

Then the one-step transition becomes

$$
q(x_t\mid x_{t-1}) =
\mathcal{N}
\left(
x_t;
\sqrt{\alpha_t}\,x_{t-1},
(1-\alpha_t)I
\right).
$$

We also define the cumulative product

$$
\bar{\alpha}_t =
\prod_{s=1}^{t}\alpha_s.
$$

The quantity $\bar{\alpha}_t$ measures how much of the original signal remains after $t$ noising steps. Since each $\alpha_s$ is less than 1, $\bar{\alpha}_t$ decreases as $t$ increases.

When $t=0$, we define

$$
\bar{\alpha}_0 = 1.
$$

At small $t$, $\bar{\alpha}_t$ is close to 1, so most of the original signal remains. At large $t$, $\bar{\alpha}_t$ is close to 0, so very little of the original signal remains.

### Direct Sampling from Any Timestep

A useful property of the forward process is that we can sample $x_t$ directly from $x_0$, without simulating all intermediate steps.

The marginal distribution is

$$
q(x_t\mid x_0) =
\mathcal{N}
\left(
x_t;
\sqrt{\bar{\alpha}_t}\,x_0,
(1-\bar{\alpha}_t)I
\right).
$$

Equivalently,

$$
x_t =
\sqrt{\bar{\alpha}_t}\,x_0
+
\sqrt{1-\bar{\alpha}_t}\,\epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).
$$

This formula is one of the main reasons diffusion training is practical. During training, we can sample a timestep $t$, sample noise $\epsilon$, construct $x_t$ directly, and ask the model to predict the noise or the clean data.

We do not need to run the forward Markov chain step by step for every training example.

### Signal-to-Noise Ratio

The noised sample

$$
x_t =
\sqrt{\bar{\alpha}_t}\,x_0
+
\sqrt{1-\bar{\alpha}_t}\,\epsilon
$$

contains two components: signal and noise.

The signal coefficient is

$$
\sqrt{\bar{\alpha}_t}.
$$

The noise coefficient is

$$
\sqrt{1-\bar{\alpha}_t}.
$$

A useful measure is the signal-to-noise ratio:

$$
\mathrm{SNR}(t) =
\frac{\bar{\alpha}_t}{1-\bar{\alpha}_t}.
$$

When $t$ is small, $\bar{\alpha}_t$ is close to 1, so the SNR is high. The sample still looks close to data. When $t$ is large, $\bar{\alpha}_t$ is close to 0, so the SNR is low. The sample is mostly noise.

The model must learn denoising behavior across a wide range of SNR values. At early timesteps, the task is local refinement. At late timesteps, the task requires global structure.

### What Happens as $T$ Becomes Large

The forward process is designed so that $x_T$ is close to a standard Gaussian:

$$
x_T \approx \mathcal{N}(0,I).
$$

This is achieved by choosing a schedule such that

$$
\bar{\alpha}_T \approx 0.
$$

Using the direct sampling formula,

$$
x_T =
\sqrt{\bar{\alpha}_T}\,x_0
+
\sqrt{1-\bar{\alpha}_T}\,\epsilon.
$$

If $\bar{\alpha}_T$ is near zero, then

$$
x_T \approx \epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).
$$

Thus the endpoint of the forward process is easy to sample from. This is crucial. To generate new data, the reverse process can start from Gaussian noise and gradually denoise it.

### PyTorch Implementation of Forward Diffusion

The forward diffusion process can be implemented directly in PyTorch.

First define a noise schedule:

```python
import torch

T = 1000
beta_start = 1e-4
beta_end = 2e-2

betas = torch.linspace(beta_start, beta_end, T)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)
```

Here `betas[t]` is $\beta_{t+1}$ if we use zero-based Python indexing. The tensor `alpha_bars` stores the cumulative products $\bar{\alpha}_t$.

Now define a function that samples $x_t$ from $x_0$:

```python
def extract(a, t, x_shape):
    """
    Select values from a 1D schedule tensor and reshape them
    for broadcasting over an input batch.
    """
    batch_size = t.shape[0]
    out = a.gather(0, t)
    return out.reshape(batch_size, *((1,) * (len(x_shape) - 1)))

def q_sample(x0, t, alpha_bars):
    """
    Sample x_t directly from x_0 using

        x_t = sqrt(alpha_bar_t) x_0
              + sqrt(1 - alpha_bar_t) epsilon.
    """
    noise = torch.randn_like(x0)

    alpha_bar_t = extract(alpha_bars, t, x0.shape)

    mean_coeff = torch.sqrt(alpha_bar_t)
    noise_coeff = torch.sqrt(1.0 - alpha_bar_t)

    xt = mean_coeff * x0 + noise_coeff * noise
    return xt, noise
```

Suppose `x0` is a batch of images:

```python
x0 = torch.randn(32, 3, 64, 64)

t = torch.randint(0, T, (32,))
xt, noise = q_sample(x0, t, alpha_bars)

print(xt.shape)     # torch.Size([32, 3, 64, 64])
print(noise.shape)  # torch.Size([32, 3, 64, 64])
```

Each image in the batch may use a different timestep. The function `extract` reshapes schedule values so that they broadcast correctly over image dimensions.

### Training View of the Forward Process

The forward process gives us a supervised learning problem.

Given clean data $x_0$, choose a random timestep $t$, sample Gaussian noise $\epsilon$, and form

$$
x_t =
\sqrt{\bar{\alpha}_t}\,x_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

Then train a neural network $\epsilon_\theta(x_t,t)$ to predict the noise $\epsilon$.

A common objective is

$$
\mathcal{L}(\theta) =
\mathbb{E}_{x_0,t,\epsilon}
\left[
\left\|
\epsilon -
\epsilon_\theta(x_t,t)
\right\|_2^2
\right].
$$

This objective asks the model to identify the noise component that was added to the data. Once the model can predict noise, it can be used to denoise $x_t$ during the reverse process.

In PyTorch, one training step may look like this:

```python
def diffusion_loss(model, x0, alpha_bars, T):
    batch_size = x0.shape[0]
    device = x0.device

    t = torch.randint(0, T, (batch_size,), device=device)
    xt, noise = q_sample(x0, t, alpha_bars.to(device))

    pred_noise = model(xt, t)

    loss = torch.nn.functional.mse_loss(pred_noise, noise)
    return loss
```

This is the core of denoising diffusion training. The full model still needs timestep embeddings, a denoising architecture such as a U-Net or transformer, and a reverse sampler. But the training signal comes directly from the forward corruption process.

### Shape Conventions

For image diffusion, common PyTorch shapes are:

| Tensor | Shape | Meaning |
|---|---|---|
| `x0` | `[B, C, H, W]` | Clean image batch |
| `t` | `[B]` | Timestep for each image |
| `noise` | `[B, C, H, W]` | Gaussian noise |
| `xt` | `[B, C, H, W]` | Noisy image batch |
| `pred_noise` | `[B, C, H, W]` | Model prediction |

The timestep tensor `t` is integer-valued. The image and noise tensors are floating-point. Usually the image values are normalized, often to the range $[-1,1]$, before applying diffusion.

For latent diffusion, the same formulas apply, but `x0` may represent latent tensors rather than pixel images. A latent tensor might have shape

```python
[B, 4, 64, 64]
```

instead of

```python
[B, 3, 512, 512]
```

The diffusion process is mathematically unchanged. Only the representation space changes.

### Why Gaussian Noise Is Used

Gaussian noise has several useful properties.

First, it is mathematically tractable. Linear combinations of Gaussian variables remain Gaussian, which gives the closed-form expression for $q(x_t\mid x_0)$.

Second, the endpoint distribution is easy to sample from. We can start generation from

$$
x_T\sim\mathcal{N}(0,I).
$$

Third, Gaussian noise works well with squared error prediction objectives. Predicting the added noise with mean squared error is simple and stable.

Fourth, Gaussian diffusion connects to score matching. The reverse process can be interpreted as learning the score, which is the gradient of the log density with respect to the noisy input.

These properties make Gaussian diffusion a convenient foundation for high-quality generative modeling.

### Discrete Versus Continuous Time

The process described above uses discrete timesteps

$$
t=1,2,\ldots,T.
$$

Many modern formulations use continuous time instead. In continuous-time diffusion, the corruption process is described using stochastic differential equations. The discrete schedule becomes a continuous noise function.

The main idea remains the same. Data is gradually transformed into noise. The model learns a reverse-time process that maps noise back to data.

Discrete-time diffusion is easier to implement and understand, so it is the usual starting point. Continuous-time formulations become important for advanced samplers, score-based models, and probability flow ODEs.

### Forward Diffusion as Information Destruction

The forward process gradually removes information about the original sample.

At early timesteps, fine details are degraded, but the object structure may remain visible. At middle timesteps, local texture and many edges disappear. At late timesteps, almost all recognizable structure is gone.

This matters because the reverse model must learn different denoising tasks at different timesteps. Near $t=T$, it must create global structure from weak signal. Near $t=0$, it must restore fine details.

The timestep $t$ therefore acts as a conditioning variable. The denoising model must know how much noise is present. Without timestep conditioning, the same input value could require different denoising behavior depending on the noise level.

### Summary

The forward diffusion process is a fixed Markov chain that gradually adds Gaussian noise to clean data. Each transition has the form

$$
q(x_t\mid x_{t-1}) =
\mathcal{N}
\left(
x_t;
\sqrt{1-\beta_t}x_{t-1},
\beta_t I
\right).
$$

By defining $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t=\prod_{s=1}^{t}\alpha_s$, we can sample any noisy timestep directly:

$$
x_t =
\sqrt{\bar{\alpha}_t}x_0
+
\sqrt{1-\bar{\alpha}_t}\epsilon.
$$

This direct formula makes diffusion training efficient. The model receives a noisy sample $x_t$, the timestep $t$, and learns to predict either the added noise, the clean data, or another equivalent parameterization.

The forward process does not generate data by itself. It defines the corruption path that the reverse model must learn to undo.

