Skip to content

Adversarial Examples

An adversarial example is an input that has been deliberately modified so that a model makes a wrong prediction, while the modification is small enough that a human observer still sees the original object.

An adversarial example is an input that has been deliberately modified so that a model makes a wrong prediction, while the modification is small enough that a human observer still sees the original object.

For an image classifier, an adversarial example may look like a normal image of a panda to a human, but the model may classify it as a gibbon. For a text classifier, a small character substitution or paraphrase may change the model’s output. For an audio system, a small perturbation may cause a speech recognizer to produce a different transcript.

Adversarial examples expose an important fact about deep learning systems: high test accuracy does not imply robustness. A model may perform well on natural test data while remaining fragile under carefully chosen perturbations.

Basic Definition

Let

fθ(x) f_\theta(x)

be a neural network with parameters θ\theta, input xx, and true label yy. An adversarial example is a modified input

xadv=x+δ, x_{\text{adv}} = x + \delta,

where δ\delta is a small perturbation chosen so that

fθ(xadv)y. f_\theta(x_{\text{adv}}) \neq y.

The perturbation is constrained so that it remains small:

δϵ. \|\delta\| \leq \epsilon.

Here ϵ\epsilon controls the attack budget. A smaller ϵ\epsilon means the attacker is allowed to change the input only slightly.

The central optimization problem is:

maxδϵL(fθ(x+δ),y). \max_{\|\delta\| \leq \epsilon} L(f_\theta(x+\delta), y).

The attacker tries to find a perturbation that maximizes the loss while staying inside the allowed perturbation region.

Why Adversarial Examples Exist

Neural networks often behave linearly in small local regions of input space. In high dimensions, many tiny changes can accumulate into a large change in the model’s internal score.

Suppose a classifier computes a logit using a linear function:

z=wx. z = w^\top x.

If we perturb the input by δ\delta, then the logit changes by

w(x+δ)wx=wδ. w^\top (x+\delta) - w^\top x = w^\top \delta.

Even when each component of δ\delta is very small, the total dot product can be large if the input dimension is large. This explains why imperceptible perturbations may still produce large model changes.

The problem becomes more severe because modern inputs are high-dimensional. A 224×224224 \times 224 RGB image has

2242243=150,528 224 \cdot 224 \cdot 3 = 150{,}528

input values. A small coordinated perturbation across many pixels can strongly affect a classifier.

Untargeted and Targeted Attacks

Adversarial attacks are usually divided into two types.

An untargeted attack tries to make the model predict any incorrect class:

fθ(xadv)y. f_\theta(x_{\text{adv}}) \neq y.

A targeted attack tries to force the model to predict a specific target class tt:

fθ(xadv)=t. f_\theta(x_{\text{adv}}) = t.

Untargeted attacks are usually easier. The attacker only needs to cross any decision boundary. Targeted attacks require pushing the input toward a particular decision region.

For example, if the true class is “cat,” an untargeted attack succeeds if the model predicts “dog,” “fox,” “car,” or any other incorrect class. A targeted attack succeeds only if the model predicts the chosen target, such as “airplane.”

Norm Constraints

The size of an adversarial perturbation is usually measured with a norm. The most common choices are LL_\infty, L2L_2, and L1L_1.

The LL_\infty norm controls the maximum change to any single input component:

δ=maxiδi. \|\delta\|_\infty = \max_i |\delta_i|.

This is common for image attacks because it limits how much each pixel can change.

The L2L_2 norm controls the Euclidean size of the perturbation:

δ2=iδi2. \|\delta\|_2 = \sqrt{\sum_i \delta_i^2}.

This measures the total geometric magnitude of the perturbation.

The L1L_1 norm controls the sum of absolute changes:

δ1=iδi. \|\delta\|_1 = \sum_i |\delta_i|.

This encourages sparse perturbations, where relatively few input components change.

Each norm defines a different threat model. An LL_\infty attack spreads small changes across many pixels. An L2L_2 attack controls total energy. An L1L_1 attack may change fewer features more strongly.

Fast Gradient Sign Method

The Fast Gradient Sign Method, or FGSM, is one of the simplest adversarial attacks.

For an input xx, label yy, model fθf_\theta, and loss function LL, FGSM computes

xadv=x+ϵsign(xL(fθ(x),y)). x_{\text{adv}} = x + \epsilon \cdot \operatorname{sign} \left( \nabla_x L(f_\theta(x), y) \right).

The gradient tells us how the loss changes with respect to the input. The sign keeps only the direction of increase for each input component. The parameter ϵ\epsilon controls the perturbation size.

FGSM is a one-step attack. It is fast because it requires only one backward pass. It is also useful as a first robustness test.

In PyTorch:

import torch
import torch.nn.functional as F

def fgsm_attack(model, x, y, epsilon):
    x_adv = x.clone().detach().requires_grad_(True)

    logits = model(x_adv)
    loss = F.cross_entropy(logits, y)

    model.zero_grad(set_to_none=True)
    loss.backward()

    perturbation = epsilon * x_adv.grad.sign()
    x_adv = x_adv + perturbation
    x_adv = torch.clamp(x_adv, 0.0, 1.0)

    return x_adv.detach()

This code assumes image inputs are normalized to the range [0,1][0,1]. If the model uses normalized inputs, the clipping range and ϵ\epsilon must be adjusted accordingly.

Projected Gradient Descent

Projected Gradient Descent, or PGD, is a stronger iterative attack. Instead of taking one large step, PGD takes several smaller steps and projects the result back into the allowed perturbation set.

For an LL_\infty attack, the update is:

x(k+1)=ΠBϵ(x)(x(k)+αsign(xL(fθ(x(k)),y))), x^{(k+1)} = \Pi_{B_\epsilon(x)} \left( x^{(k)} + \alpha \cdot \operatorname{sign} \left( \nabla_x L(f_\theta(x^{(k)}), y) \right) \right),

where α\alpha is the step size and ΠBϵ(x)\Pi_{B_\epsilon(x)} projects the adversarial input back into the ϵ\epsilon-ball around the original input.

In PyTorch:

def pgd_attack(model, x, y, epsilon, alpha, steps):
    x_orig = x.detach()
    x_adv = x_orig + torch.empty_like(x_orig).uniform_(-epsilon, epsilon)
    x_adv = torch.clamp(x_adv, 0.0, 1.0)

    for _ in range(steps):
        x_adv = x_adv.detach().requires_grad_(True)

        logits = model(x_adv)
        loss = F.cross_entropy(logits, y)

        model.zero_grad(set_to_none=True)
        loss.backward()

        with torch.no_grad():
            x_adv = x_adv + alpha * x_adv.grad.sign()

            delta = torch.clamp(x_adv - x_orig, min=-epsilon, max=epsilon)
            x_adv = torch.clamp(x_orig + delta, 0.0, 1.0)

    return x_adv.detach()

PGD is commonly used as a robustness benchmark because it is stronger than FGSM. A model that survives FGSM may still fail under PGD.

White-Box and Black-Box Attacks

In a white-box attack, the attacker has access to the model architecture, parameters, and gradients. FGSM and PGD are white-box attacks when they use exact gradients from the target model.

In a black-box attack, the attacker does not have direct access to the model internals. The attacker may only observe inputs and outputs. Black-box attacks can still succeed through query-based estimation or transfer.

Transfer means that adversarial examples generated for one model often fool another model. This happens because different models trained on the same task may learn similar decision boundaries.

This property matters in deployment. Keeping model weights private reduces some attack options, but it does not eliminate adversarial risk.

Adversarial Training

Adversarial training is the most widely used defense. Instead of training only on clean examples, the model is trained on adversarial examples.

The robust training objective is

minθE(x,y)[maxδϵL(fθ(x+δ),y)]. \min_\theta \mathbb{E}_{(x,y)} \left[ \max_{\|\delta\| \leq \epsilon} L(f_\theta(x+\delta), y) \right].

The inner maximization finds adversarial perturbations. The outer minimization updates the model so that it classifies them correctly.

A simple adversarial training loop with FGSM looks like this:

def train_one_epoch_adversarial(model, loader, optimizer, epsilon, device):
    model.train()

    for x, y in loader:
        x = x.to(device)
        y = y.to(device)

        x_adv = fgsm_attack(model, x, y, epsilon)

        logits = model(x_adv)
        loss = F.cross_entropy(logits, y)

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

PGD adversarial training is stronger but more expensive because each batch requires several backward passes to generate adversarial examples.

Adversarial training often improves robustness but may reduce clean accuracy. This tradeoff is one of the main practical issues in robust deep learning.

Robustness and Accuracy Tradeoff

A standard model learns decision boundaries that work well for natural data. A robust model must classify not only each training point correctly, but also a neighborhood around each point.

This makes the learning problem harder. The model must reserve larger regions of input space for each class. If classes are close together, robust classification may conflict with clean accuracy.

The tradeoff can be understood geometrically. Standard training asks:

fθ(x)=y. f_\theta(x) = y.

Robust training asks:

fθ(x+δ)=yfor allδϵ. f_\theta(x+\delta) = y \quad \text{for all} \quad \|\delta\| \leq \epsilon.

The second condition is stronger. It requires local stability around each example.

Gradient Masking

Some defenses appear robust because they make gradients unusable, not because they make the model truly robust. This failure mode is called gradient masking.

Gradient masking can occur when a defense introduces nondifferentiable preprocessing, saturating activations, random transformations, or numerical instability. Simple gradient attacks may fail, but stronger attacks may still succeed.

Signs of gradient masking include:

SymptomInterpretation
FGSM fails but black-box transfer succeedsGradients may be misleading
Increasing attack steps does not improve attack successOptimization may be obstructed
Random restarts greatly improve attacksLoss surface may contain poor local regions
Robustness disappears under adaptive attacksDefense was attack-specific

A defense should be evaluated against strong adaptive attacks, not only against the attack used during development.

Adversarial Examples Beyond Images

Adversarial examples are most often introduced with images, but the phenomenon is broader.

In text models, adversarial perturbations include character swaps, synonym substitutions, paraphrases, prompt injections, and carefully constructed instructions. Text attacks are constrained differently because discrete tokens cannot be changed by infinitesimal amounts.

In speech models, perturbations may be added to waveforms so that a system transcribes a different phrase. Physical-world conditions such as room acoustics, microphones, and background noise complicate the attack.

In reinforcement learning, small changes to observations can alter an agent’s policy. This can cause poor actions in control systems.

In retrieval and ranking systems, adversarial inputs may manipulate embeddings or search scores.

The common theme is the same: the attacker changes the input to produce an incorrect or undesired model behavior.

Evaluating Adversarial Robustness

A robustness evaluation should define a threat model. The threat model specifies what the attacker can know and do.

At minimum, one should specify:

ComponentQuestion
AccessIs the attack white-box or black-box?
Perturbation setWhich norm or transformation is allowed?
BudgetWhat is the maximum perturbation size?
ObjectiveUntargeted or targeted attack?
Attack strengthHow many steps, restarts, or queries?
Evaluation metricRobust accuracy, attack success rate, or loss?

Robust accuracy measures accuracy under attack. If a model has 92 percent clean accuracy and 38 percent robust accuracy under PGD, the second number better reflects security against that threat model.

However, robustness is always relative to the specified attack. A model robust to small LL_\infty image perturbations may still be vulnerable to rotations, translations, corruptions, patches, or semantic changes.

Practical Guidance

For PyTorch work, adversarial robustness should be treated as an evaluation discipline, not as a single technique.

A good workflow is:

  1. Train a standard baseline.
  2. Measure clean accuracy.
  3. Evaluate FGSM as a quick first attack.
  4. Evaluate PGD with multiple steps and random starts.
  5. Check for gradient masking.
  6. Try black-box transfer attacks.
  7. Train with adversarial examples.
  8. Compare clean accuracy and robust accuracy.
  9. Report the threat model clearly.

The main mistake is to claim robustness without defining the attacker. Robustness has no single universal meaning. It only has meaning under a stated constraint.

Minimal End-to-End Example

The following example shows the core structure of adversarial evaluation. It assumes a classifier, a test loader, and image inputs in [0,1][0,1].

@torch.no_grad()
def accuracy(model, loader, device):
    model.eval()
    correct = 0
    total = 0

    for x, y in loader:
        x = x.to(device)
        y = y.to(device)

        logits = model(x)
        pred = logits.argmax(dim=1)

        correct += (pred == y).sum().item()
        total += y.numel()

    return correct / total

def robust_accuracy_fgsm(model, loader, epsilon, device):
    model.eval()
    correct = 0
    total = 0

    for x, y in loader:
        x = x.to(device)
        y = y.to(device)

        x_adv = fgsm_attack(model, x, y, epsilon)

        with torch.no_grad():
            logits = model(x_adv)
            pred = logits.argmax(dim=1)

        correct += (pred == y).sum().item()
        total += y.numel()

    return correct / total

Clean accuracy measures performance on ordinary inputs. Robust accuracy measures performance on attacked inputs.

A robust model should be judged by both. Clean accuracy tells us whether the model solves the natural task. Robust accuracy tells us whether the model remains stable under the specified attack.

Summary

Adversarial examples are inputs deliberately modified to cause model failure. They show that neural networks can be accurate on test data while remaining sensitive to small, structured perturbations.

FGSM gives a fast one-step attack. PGD gives a stronger iterative attack. White-box attacks use model gradients. Black-box attacks rely on queries or transfer. Adversarial training improves robustness by training on attacked examples, but it increases cost and may reduce clean accuracy.

The key principle is precise evaluation. A robustness claim should always state the threat model, perturbation budget, attack method, and metric.