Denoising Autoencoders

A denoising autoencoder learns to recover a clean input from a corrupted version of that input. Instead of copying $x$ to $\hat{x}$ , the model receives a noisy input $\tilde{x}$ and learns to reconstruct the original clean input $x$ .

\tilde{x} = q(\tilde{x}\mid x)

z = f_\theta(\tilde{x})

\hat{x} = g_\phi(z)

The training objective is

L(x,\hat{x}) = \|x - \hat{x}\|^2.

The corruption process $q(\tilde{x}\mid x)$ may add Gaussian noise, mask coordinates, drop patches, shuffle local regions, or apply other transformations. The model must learn which parts of the input are stable structure and which parts are noise.

Motivation

A standard autoencoder can learn to copy the input, especially when the latent dimension is large or the model has high capacity. Denoising prevents this simple solution. The model cannot merely reproduce its input because the input has been damaged.

The encoder must infer the original signal from partial or noisy evidence. This encourages the latent representation to capture regularities in the data distribution.

For example, if a digit image is corrupted by random noise, the model must learn that strokes are coherent, backgrounds are mostly empty, and digit shapes follow recurring patterns. If a sentence has masked words, the model must learn grammar, local context, and semantic expectation.

Denoising therefore turns reconstruction into a prediction problem.

The Denoising Objective

Given a clean input $x$ , we first sample a corrupted input:

\tilde{x} \sim q(\tilde{x}\mid x).

The autoencoder then computes

\hat{x} = g_\phi(f_\theta(\tilde{x})).

The objective minimizes expected reconstruction error:

\min_{\theta,\phi} \mathbb{E}_{x \sim p_{\text{data}}} \mathbb{E}_{\tilde{x} \sim q(\tilde{x}\mid x)} \left[ L(x, g_\phi(f_\theta(\tilde{x}))) \right].

The loss compares the reconstruction with the clean input $x$ , not with the corrupted input $\tilde{x}$ . This distinction is essential.

If the model were trained to reconstruct $\tilde{x}$ , it would learn ordinary copying. Since it is trained to reconstruct $x$ , it must remove the corruption.

Gaussian Noise Corruption

For continuous inputs, a common corruption process adds Gaussian noise:

\tilde{x} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0,\sigma^2 I).

The noise level $\sigma$ controls task difficulty. Small noise produces an easy denoising task. Large noise forces the model to rely more heavily on learned structure.

For image tensors in PyTorch:

import torch

def add_gaussian_noise(x: torch.Tensor, sigma: float) -> torch.Tensor:
    noise = sigma * torch.randn_like(x)
    return x + noise

For normalized images, values should usually be clipped or rescaled after corruption:

def add_clipped_gaussian_noise(
    x: torch.Tensor,
    sigma: float,
) -> torch.Tensor:
    noisy = x + sigma * torch.randn_like(x)
    return noisy.clamp(0.0, 1.0)

This corruption is simple and effective for image denoising, signal recovery, and representation learning.

Masking Corruption

Another common corruption process randomly sets some coordinates to zero:

\tilde{x} = m \odot x,

where $m$ is a binary mask and $\odot$ denotes elementwise multiplication.

Each coordinate is kept with probability $p$ :

m_i \sim \text{Bernoulli}(p).

In PyTorch:

def mask_input(x: torch.Tensor, keep_prob: float) -> torch.Tensor:
    mask = torch.rand_like(x) < keep_prob
    return x * mask

Masking is useful when missing information is a natural part of the problem. In images, masking may remove pixels or patches. In language, masking may remove tokens. In tabular data, masking may simulate missing values.

Masked reconstruction is central to many modern self-supervised models. Masked language modeling and masked image modeling can both be understood as denoising objectives.

Patch Corruption

For images, corrupting individual pixels may be too local. A model can sometimes reconstruct missing pixels from neighboring pixels without learning high-level structure. Patch corruption is harder.

Suppose an image has shape

[B, C, H, W].

A patch corruption process removes rectangular regions. The model must infer missing content from the surrounding context.

A simple patch mask can be written as:

def random_patch_mask(
    x: torch.Tensor,
    patch_size: int,
    num_patches: int,
) -> torch.Tensor:
    y = x.clone()
    batch_size, channels, height, width = y.shape

    for b in range(batch_size):
        for _ in range(num_patches):
            top = torch.randint(0, height - patch_size + 1, ()).item()
            left = torch.randint(0, width - patch_size + 1, ()).item()
            y[b, :, top:top + patch_size, left:left + patch_size] = 0

    return y

This simple version uses Python loops and is not optimized. It is still useful for explaining the idea.

Patch corruption encourages the model to learn object-level and scene-level regularities rather than only local smoothing.

Denoising Autoencoder Architecture

A denoising autoencoder can use the same architecture as a standard autoencoder. The main change is the training input.

The encoder receives $\tilde{x}$ . The loss compares the output with $x$ .

For flattened image inputs:

import torch
from torch import nn
import torch.nn.functional as F

class DenoisingAutoencoder(nn.Module):
    def __init__(self, input_dim: int, latent_dim: int):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim),
            nn.ReLU(),
        )

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim),
            nn.Sigmoid(),
        )

    def forward(self, x_noisy: torch.Tensor) -> torch.Tensor:
        z = self.encoder(x_noisy)
        x_hat = self.decoder(z)
        return x_hat

A training step:

model = DenoisingAutoencoder(input_dim=784, latent_dim=64)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.rand(128, 784)
x_noisy = add_clipped_gaussian_noise(x, sigma=0.25)

x_hat = model(x_noisy)
loss = F.mse_loss(x_hat, x)

optimizer.zero_grad()
loss.backward()
optimizer.step()

The model sees the noisy input but is graded against the clean input.

Convolutional Denoising Autoencoders

For images, convolutional architectures usually work better than fully connected architectures. They preserve spatial structure and share parameters across locations.

A simple convolutional denoising autoencoder:

class ConvDenoisingAutoencoder(nn.Module):
    def __init__(self):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )

        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(
                64,
                32,
                kernel_size=2,
                stride=2,
            ),
            nn.ReLU(),

            nn.ConvTranspose2d(
                32,
                1,
                kernel_size=2,
                stride=2,
            ),
            nn.Sigmoid(),
        )

    def forward(self, x_noisy: torch.Tensor) -> torch.Tensor:
        z = self.encoder(x_noisy)
        x_hat = self.decoder(z)
        return x_hat

For MNIST-like images with shape

[B,1,28,28],

the encoder reduces spatial resolution while increasing channel depth. The decoder upsamples back to the original image shape.

Loss Functions for Denoising

The simplest denoising loss is mean squared error:

L_{\text{MSE}} = \|x - \hat{x}\|^2.

MSE works well when the output is continuous and noise is Gaussian. It encourages the model to predict the conditional mean of the clean input given the corrupted input.

For binary or normalized pixel values, binary cross-entropy may be used:

L_{\text{BCE}} = -\sum_i \left[ x_i \log \hat{x}_i + (1-x_i)\log(1-\hat{x}_i) \right].

For images, pixel losses can produce blurry reconstructions when the missing content is ambiguous. More advanced systems may use perceptual losses, adversarial losses, or diffusion objectives to produce sharper samples.

For representation learning, the reconstruction loss is often only a proxy. The learned representation $z$ may be evaluated by downstream classification, retrieval, or clustering.

Denoising and Manifold Learning

A useful way to understand denoising autoencoders is through the data manifold.

The clean data distribution occupies a structured region of input space. Corruption moves examples away from this region. The denoising model learns to map corrupted points back toward likely clean points.

If $x$ lies near the data manifold and $\tilde{x}$ is a noisy version, then the reconstruction

\hat{x} = g_\phi(f_\theta(\tilde{x}))

should move $\tilde{x}$ toward the manifold.

This view connects denoising autoencoders to score-based generative modeling. Under certain assumptions, denoising teaches the model information about the direction in which data density increases. Diffusion models build a full generative process from repeated denoising steps.

Relation to Diffusion Models

Diffusion models can be viewed as a more powerful and systematic form of denoising. They corrupt data through a sequence of noise levels and train a model to reverse the corruption.

A denoising autoencoder usually learns

\tilde{x} \to x

for one corruption level or a small set of corruption levels.

A diffusion model learns denoising across many noise levels:

x_t \to x_{t-1}

or predicts the noise added at timestep $t$ .

The key shared idea is that learning to remove noise teaches the model the structure of the data distribution. The difference is that diffusion models turn this into an iterative sampling procedure.

Denoising for Text

Denoising is also central to text representation learning.

A text input may be corrupted by:

Corruption	Example
Token masking	Replace tokens with `[MASK]`
Token deletion	Remove words
Token permutation	Shuffle spans
Span masking	Remove phrase-level spans
Infilling	Generate missing text spans

The model learns to reconstruct the original sequence from the corrupted sequence.

This idea appears in masked language models, denoising sequence-to-sequence models, and text infilling systems. The corruption process forces the model to learn syntax, semantics, and discourse structure.

For token IDs, masking can be implemented by replacing selected positions with a special mask token ID:

def mask_tokens(
    token_ids: torch.Tensor,
    mask_token_id: int,
    mask_prob: float,
) -> torch.Tensor:
    mask = torch.rand(token_ids.shape, device=token_ids.device) < mask_prob
    corrupted = token_ids.clone()
    corrupted[mask] = mask_token_id
    return corrupted

A language model may then predict the original token IDs at masked positions.

Choosing the Corruption Process

The corruption process determines what the model must learn.

Weak corruption may allow shallow copying. Strong corruption may make reconstruction too difficult. The best corruption level usually lies between these extremes.

Corruption strength	Likely behavior
Too weak	Model learns near-identity mapping
Moderate	Model learns stable structure
Too strong	Model loses too much information

The corruption should match the desired invariances. If small pixel noise should not matter, add pixel noise. If missing patches should be inferred from context, mask patches. If text meaning should survive word deletion, use token deletion or span masking.

A poorly chosen corruption process can teach the wrong invariances. For example, if color is important for the task, aggressive color corruption may remove useful information.

Practical Training Details

Denoising autoencoders are usually trained with fresh corruption on each batch. Instead of storing corrupted examples, generate corruption dynamically during training.

This exposes the model to many noisy versions of the same clean example and reduces overfitting to a fixed noise pattern.

A typical training loop has the following structure:

for x, _ in dataloader:
    x = x.to(device)

    x_noisy = add_clipped_gaussian_noise(x, sigma=0.25)

    x_hat = model(x_noisy)
    loss = F.mse_loss(x_hat, x)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

For image data, ensure that the model output range matches the target range. If the target image is normalized to $[0,1]$ , a sigmoid output is reasonable. If the target is standardized to mean 0 and variance 1, a linear output is usually better.

Failure Modes

Denoising autoencoders can fail in several ways.

If the model has too much capacity and corruption is weak, it may learn an almost identity function.

If corruption is too strong, the target may become ambiguous. The model may produce averaged reconstructions.

If the loss is pixelwise MSE, reconstructions may be smooth or blurry.

If the corruption process differs from real deployment noise, denoising performance may not transfer.

If the latent dimension is too small, the model may discard details needed for reconstruction.

These failures are usually addressed by tuning corruption strength, architecture, latent dimension, and loss function.

Summary

A denoising autoencoder reconstructs clean data from corrupted data. The corruption process prevents trivial copying and forces the model to learn stable structure in the data distribution.

The main training pattern is simple: corrupt $x$ to obtain $\tilde{x}$ , feed $\tilde{x}$ into the model, and compare the output with $x$ . Gaussian noise, masking, and patch removal are common corruption methods.

Denoising autoencoders connect classical autoencoders to modern self-supervised learning and diffusion models. Their central lesson is that useful representations can be learned by recovering missing or corrupted information.