# Denoising Autoencoders

A denoising autoencoder learns to reconstruct a clean input from a corrupted version of that input. Instead of copying $x$ to $\hat{x}$, the model receives a noisy input $\tilde{x}$ and must recover the original $x$.

The encoder maps the corrupted input to a latent representation:

$$
z = f_\theta(\tilde{x}).
$$

The decoder reconstructs the clean input:

$$
\hat{x} = g_\phi(z).
$$

The training objective is

$$
\min_{\theta,\phi}
\frac{1}{N}
\sum_{i=1}^N
\|x_i - g_\phi(f_\theta(\tilde{x}_i))\|_2^2.
$$

The corruption process forces the model to learn stable structure in the data rather than memorizing individual entries. This makes denoising autoencoders important for representation learning, robustness, and generative modeling.

### Corruption as a Learning Signal

A standard autoencoder can learn an identity-like mapping when its capacity is high. A denoising autoencoder avoids this by hiding or corrupting part of the input.

The model must infer missing or damaged information from context. For images, it may infer missing pixels from neighboring pixels. For text, it may infer masked tokens from nearby words. For audio, it may recover speech structure from noisy waveforms.

A corruption process samples a noisy input:

$$
\tilde{x} \sim q(\tilde{x}\mid x).
$$

Then the model learns to reconstruct $x$ from $\tilde{x}$. The conditional distribution $q(\tilde{x}\mid x)$ is chosen by the designer.

Common corruption methods include:

| Corruption type | Example |
|---|---|
| Gaussian noise | Add random continuous noise |
| Masking noise | Set some input entries to zero |
| Salt-and-pepper noise | Randomly set pixels to 0 or 1 |
| Dropout noise | Randomly remove features |
| Token masking | Replace tokens with a mask symbol |
| Cropping or erasing | Remove spatial regions |

The corruption should be strong enough to prevent trivial copying but not so strong that reconstruction becomes impossible.

### Gaussian Noise

For continuous inputs, a common corruption process is additive Gaussian noise:

$$
\tilde{x} = x + \epsilon,
\quad
\epsilon \sim \mathcal{N}(0,\sigma^2 I).
$$

Here $\sigma$ controls the noise strength. Small $\sigma$ creates lightly corrupted examples. Large $\sigma$ creates heavily corrupted examples.

In PyTorch:

```python
def add_gaussian_noise(x, sigma: float = 0.1):
    noise = sigma * torch.randn_like(x)
    return x + noise
```

For normalized images, the noisy result is often clipped to the valid range:

```python
x_noisy = add_gaussian_noise(x, sigma=0.2)
x_noisy = x_noisy.clamp(0.0, 1.0)
```

Gaussian corruption teaches the model to smooth out small perturbations. It is useful when the input has continuous structure, such as images, audio features, sensor measurements, and embeddings.

### Masking Noise

Masking noise randomly removes entries from the input:

$$
\tilde{x} = m \odot x,
$$

where $m$ is a binary mask and $\odot$ denotes elementwise multiplication.

Each mask entry may be sampled as

$$
m_j \sim \mathrm{Bernoulli}(p).
$$

If $m_j = 1$, the entry is kept. If $m_j = 0$, the entry is removed.

In PyTorch:

```python
def apply_masking_noise(x, keep_prob: float = 0.8):
    mask = torch.rand_like(x) < keep_prob
    return x * mask
```

Masking noise is especially useful when some features are missing at inference time or when the model should learn dependencies between input dimensions.

For images, masking individual pixels may be less effective than masking patches. Patch masking forces the model to infer larger spatial structure.

### Denoising Objective

The denoising loss compares the reconstructed output $\hat{x}$ with the clean input $x$, not the corrupted input $\tilde{x}$:

$$
L(x,\hat{x}) =
\|x-\hat{x}\|_2^2.
$$

The full training step is:

```python
import torch
from torch import nn

model = Autoencoder(input_dim=784, latent_dim=64)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

x = torch.rand(128, 784)

x_noisy = x + 0.2 * torch.randn_like(x)
x_noisy = x_noisy.clamp(0.0, 1.0)

x_hat, z = model(x_noisy)
loss = loss_fn(x_hat, x)

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

The important detail is that `model` receives `x_noisy`, but the loss compares against `x`.

This changes the learning problem. The model must learn a function that maps corrupted samples back toward the data manifold.

### Denoising and the Data Manifold

A useful way to understand denoising autoencoders is through the data manifold.

Clean data tends to occupy a structured region of input space. Noise pushes examples away from this region. The denoising model learns to move corrupted examples back toward likely clean examples.

For an image, random noise may produce unnatural local variation. The denoising autoencoder learns that natural images usually contain smooth regions, edges, textures, and coherent object shapes. It uses these regularities to remove noise.

For text, a masked or corrupted token sequence can often be repaired using grammar and semantics. The model learns linguistic structure because it must predict missing content from context.

This idea connects denoising autoencoders to self-supervised learning. The supervision comes from the input itself: corrupt the input, then train the model to recover the original.

### Relation to Masked Modeling

Masked modeling is a denoising task. In masked language modeling, some tokens are hidden, and the model predicts them from context.

For a token sequence

$$
x = (x_1, x_2, \ldots, x_T),
$$

we construct a corrupted sequence $\tilde{x}$ by replacing some tokens with a mask token. The model predicts the original tokens at masked positions.

The loss is usually computed only on masked positions:

$$
L =
-\sum_{t\in M}
\log p_\theta(x_t \mid \tilde{x}),
$$

where $M$ is the set of masked positions.

This is the core idea behind many encoder-based language models. Similar ideas appear in masked image modeling, masked audio modeling, and multimodal representation learning.

A denoising autoencoder can therefore be viewed as a general form of masked prediction.

### Relation to Diffusion Models

Denoising autoencoders also provide intuition for diffusion models.

A diffusion model repeatedly corrupts data with noise, then trains a neural network to reverse the corruption. The difference is that diffusion models usually define a sequence of noise levels and learn a conditional denoising process for each level.

A denoising autoencoder typically learns one-step reconstruction:

$$
\tilde{x} \to \hat{x}.
$$

A diffusion model learns many denoising steps:

$$
x_T \to x_{T-1} \to \cdots \to x_0.
$$

Both rely on the same basic principle: learn how to recover clean structure from corrupted observations.

### Convolutional Denoising Autoencoder

For images, a convolutional denoising autoencoder is usually preferable to a fully connected model. It preserves spatial locality and shares filters across positions.

```python
import torch
from torch import nn

class ConvDenoisingAutoencoder(nn.Module):
    def __init__(self):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )

        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(64, 32, kernel_size=2, stride=2),
            nn.ReLU(),

            nn.ConvTranspose2d(32, 1, kernel_size=2, stride=2),
            nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z
```

Training step:

```python
model = ConvDenoisingAutoencoder()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

x = torch.rand(64, 1, 28, 28)

x_noisy = x + 0.3 * torch.randn_like(x)
x_noisy = x_noisy.clamp(0.0, 1.0)

x_hat, z = model(x_noisy)

loss = loss_fn(x_hat, x)

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

The input and output have the same shape:

```python
x.shape       # [64, 1, 28, 28]
x_noisy.shape # [64, 1, 28, 28]
x_hat.shape   # [64, 1, 28, 28]
```

The latent tensor $z$ has lower spatial resolution and higher channel depth.

### Choosing the Noise Level

The noise level controls task difficulty.

If the noise is too weak, the model may still learn a near-identity mapping. It receives almost the original input and needs little abstraction.

If the noise is too strong, the target becomes ambiguous. Many clean inputs could have produced the same corrupted input.

Good corruption creates a useful prediction problem. The model should need structure, but the reconstruction should remain possible.

| Noise level | Effect |
|---:|---|
| Very low | Nearly trivial reconstruction |
| Moderate | Useful representation learning |
| High | Strong abstraction, possible blur |
| Extreme | Ambiguous or impossible reconstruction |

In practice, one often trains with random noise strengths. This teaches the model to handle a range of corruptions.

```python
sigma = torch.empty(1).uniform_(0.05, 0.3).item()
x_noisy = x + sigma * torch.randn_like(x)
x_noisy = x_noisy.clamp(0.0, 1.0)
```

### Robust Representations

Denoising improves robustness because the encoder must produce useful representations from imperfect inputs.

A robust representation changes slowly under small perturbations. If $x$ and $\tilde{x}$ are close, then their latent codes should often be close:

$$
f_\theta(x) \approx f_\theta(\tilde{x}).
$$

This does not mean all differences should be ignored. The model should ignore nuisance noise while preserving meaningful variation.

For example, in speech recognition, background noise should matter less than phonetic content. In image recognition, sensor noise should matter less than object shape. In document understanding, formatting noise should matter less than text meaning.

### Denoising Versus Dropout

Denoising autoencoders and dropout both use random corruption, but they apply it in different places.

| Method | Corruption target | Goal |
|---|---|---|
| Denoising autoencoder | Input | Recover clean data |
| Dropout | Hidden activations | Regularize representation |
| Masked modeling | Tokens or patches | Predict missing content |
| Diffusion model | Data at many noise levels | Generate samples |

Dropout is usually used as a regularizer inside supervised networks. Denoising is usually used as a reconstruction or self-supervised objective.

The two methods can also be combined. An autoencoder may receive noisy inputs and use dropout inside the encoder or decoder.

### Information Learned by Denoising

A denoising autoencoder learns dependencies among input dimensions.

If part of the input is corrupted, the model must use the remaining parts to infer the missing or noisy values. This forces it to learn statistical structure.

For image data, it learns that neighboring pixels are correlated and that edges have coherent geometry.

For language data, it learns that words depend on syntax, semantics, and discourse context.

For tabular data, it learns correlations among fields.

For graph data, it may learn relationships among node features and neighborhood structure.

This makes denoising a general-purpose self-supervised objective.

### Evaluation

A denoising autoencoder can be evaluated in several ways.

The simplest metric is reconstruction loss on held-out corrupted examples:

$$
\frac{1}{N}
\sum_{i=1}^N
\|x_i-\hat{x}_i\|_2^2.
$$

For images, one may also use PSNR, SSIM, or perceptual metrics. For text, token prediction accuracy or negative log-likelihood is more appropriate.

For representation learning, reconstruction quality alone is insufficient. A model may reconstruct well but produce poor features for downstream tasks. The latent representation should also be evaluated with classifiers, clustering metrics, retrieval quality, or transfer learning.

### Failure Modes

Denoising autoencoders have several common failure modes.

The first is identity learning. If corruption is too weak, the model learns to copy.

The second is over-smoothing. With strong noise and mean squared error, image reconstructions may become blurry because the model predicts an average of possible clean images.

The third is shortcut learning. The model may exploit artifacts of the corruption process rather than learning meaningful structure.

The fourth is poor generalization to different noise. A model trained only on Gaussian noise may fail on occlusion, compression artifacts, or missing patches.

The fifth is mismatch with the downstream task. A representation optimized for pixel denoising may not capture semantic features.

These failures are addressed by better corruption design, stronger architectures, perceptual losses, multi-scale objectives, and downstream evaluation.

### Summary

A denoising autoencoder reconstructs a clean input from a corrupted input. The corruption process creates a self-supervised learning signal and prevents trivial copying.

The model learns stable structure in the data by mapping noisy examples back toward clean examples. This supports robustness, representation learning, masked modeling, and generative modeling.

Denoising autoencoders connect naturally to masked language models, masked image models, and diffusion models. They form one of the central bridges between classical autoencoders and modern generative deep learning.