Attribution Methods

Attribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output. Saliency maps are one family of attribution methods. This section treats attribution more broadly.

The central question is: which parts of the computation were responsible for this prediction?

For an image classifier, attribution may ask which pixels supported the class “dog.” For a text classifier, it may ask which tokens caused a toxic label. For a recommender, it may ask which user features or past events contributed most to a ranking score.

Attribution is useful for debugging, auditing, and scientific analysis. It has limits. A numerical attribution score depends on a chosen method, baseline, and model interface. It should be read as evidence, not as a complete explanation.

Attribution Targets

An attribution method needs a target quantity. Usually this is a scalar output from the model.

Examples include:

Target	Meaning
Logit $z_c$	Score for class $c$ before softmax
Probability $p_c$	Predicted probability for class $c$
Loss $L$	Training objective for an input-label pair
Embedding similarity	Retrieval score between two vectors
Reward estimate	Value used by a policy or agent
Token log probability	Language model score for a generated token

For classification, logits are often better attribution targets than probabilities. Softmax probabilities couple all classes together and may saturate. A logit is more direct: it measures evidence for one class before normalization.

Let the target scalar be

F(x).

An attribution method returns scores

A_1, A_2, \ldots, A_d,

where $A_i$ measures the contribution of input feature $x_i$ to $F(x)$ .

Local Versus Global Attribution

Local attribution explains one prediction. Global attribution explains model behavior over a dataset.

A local explanation might say that pixels around a wheel contributed to a “car” prediction for one image. A global explanation might say that the model generally relies on wheels, road texture, and side-view contours when predicting “car.”

Local attribution is useful for case analysis. Global attribution is useful for auditing model behavior.

A simple way to estimate global attribution is to average local attribution scores over many examples:

\bar{A}_i = \frac{1}{N} \sum_{n=1}^N |A_i^{(n)}|.

This works best for tabular data where feature $i$ has the same meaning across examples. For images and text, features are position-dependent, so aggregation requires care.

Gradient-Based Attribution

Gradient-based attribution uses derivatives of the target output with respect to the input.

The simplest form is

A_i = \frac{\partial F(x)}{\partial x_i}.

This measures local sensitivity. If $A_i$ is large, a small change in $x_i$ would strongly change $F(x)$ .

A related method is gradient times input:

A_i = x_i \frac{\partial F(x)}{\partial x_i}.

This combines sensitivity with feature magnitude. A feature receives high attribution when it is both present and influential.

In PyTorch:

import torch

def gradient_times_input(model, x, target_class=None):
    model.eval()

    x = x.clone().detach().requires_grad_(True)
    logits = model(x)

    if target_class is None:
        target_class = logits.argmax(dim=1)

    target = logits.gather(1, target_class.view(-1, 1)).sum()

    model.zero_grad(set_to_none=True)
    target.backward()

    attribution = x * x.grad

    return attribution.detach()

For images, this tensor is usually reduced across channels. For tabular inputs, each feature attribution can be inspected directly.

Perturbation-Based Attribution

Perturbation methods change parts of the input and measure the effect on the output.

If feature $i$ is replaced by a baseline value $x_i'$ , the attribution can be estimated as

A_i = F(x) - F(x_{\setminus i}),

where $x_{\setminus i}$ means the input with feature $i$ removed, masked, or replaced.

This approach is model-agnostic. It does not need gradients. It works with neural networks, tree models, retrieval systems, and black-box APIs.

The drawback is cost. Evaluating each feature separately requires many forward passes. For high-dimensional inputs, perturbation methods often use groups of features, such as image patches or text spans.

@torch.no_grad()
def feature_ablation(model, x, target_class, baseline=0.0):
    model.eval()

    logits = model(x)
    base_score = logits.gather(1, target_class.view(-1, 1)).squeeze(1)

    B, D = x.shape
    attributions = torch.zeros_like(x)

    for j in range(D):
        x_masked = x.clone()
        x_masked[:, j] = baseline

        logits_masked = model(x_masked)
        score_masked = logits_masked.gather(
            1,
            target_class.view(-1, 1),
        ).squeeze(1)

        attributions[:, j] = base_score - score_masked

    return attributions

This example assumes tabular input of shape [B, D].

Occlusion, Ablation, and Masking

Occlusion removes a region of the input. Ablation removes a feature, neuron, layer, head, or component. Masking replaces part of the input with a neutral value.

These methods share the same logic: remove something and observe the change.

For a model component $h$ , an ablation score can be written as

A_h = F(x) - F_{\text{ablated }h}(x).

If removing $h$ greatly changes the output, then $h$ was important for that prediction.

In transformers, ablation may be applied to attention heads, MLP channels, residual stream directions, or whole layers. In convolutional networks, ablation may be applied to channels or spatial regions.

Ablation is closer to causal intervention than a pure gradient, but it still depends on how the component is removed. Replacing with zero, mean activation, noise, or a resampled value can produce different conclusions.

Integrated Gradients Revisited

Integrated gradients are an attribution method with a completeness property. The total attribution approximately equals the change in model output from baseline to input:

\sum_i \operatorname{IG}_i(x) \approx F(x) - F(x').

Here $x'$ is the baseline input.

This property makes integrated gradients more interpretable than raw gradients in many settings. The method distributes the output difference across input features.

The baseline remains important. For an image, a black image may be reasonable for some datasets and poor for others. For text, a padding-token baseline may not represent a neutral sentence. For tabular data, a population mean, zero vector, or domain-specific reference point may be better.

SHAP Values

SHAP values come from cooperative game theory. They attribute a model output to input features by averaging each feature’s marginal contribution across many possible feature subsets.

For feature $i$ , the Shapley value is

\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(d-|S|-1)!}{d!} \left[ F(S \cup \{i\}) - F(S) \right].

Here $N$ is the set of all features, $S$ is a subset of features, and $d$ is the total number of features.

The idea is simple: measure how much feature $i$ helps when added to every possible context, then average those contributions fairly.

Exact SHAP is expensive because it requires evaluating many feature subsets. Practical implementations use approximations.

SHAP is most common for tabular models. It is less straightforward for images and language because “removing a feature” is ambiguous. Removing a word, pixel, or patch can create inputs that are outside the natural data distribution.

LIME

LIME explains a prediction by fitting a simple local surrogate model around the input.

The procedure is:

Generate perturbed versions of the input.
Evaluate the original model on those perturbations.
Weight perturbations by similarity to the original input.
Fit an interpretable model, often sparse linear regression.
Use the surrogate model’s coefficients as attributions.

Mathematically, LIME solves a local approximation problem:

g^* = \arg\min_{g \in G} \sum_{z} \pi_x(z) \left( F(z) - g(z) \right)^2 + \Omega(g).

Here $G$ is a class of interpretable models, $\pi_x(z)$ gives higher weight to perturbations near $x$ , and $\Omega(g)$ penalizes complexity.

LIME is model-agnostic and intuitive. Its weakness is instability. Different perturbation choices, distance metrics, and random seeds may give different explanations.

Attention as Attribution

Attention weights are sometimes interpreted as explanations. In a transformer, an attention head computes weights over tokens, so it is tempting to say that high attention means high importance.

This interpretation is incomplete. Attention weights show where information is read from inside one operation. They do not by themselves measure causal contribution to the final output.

A token can receive high attention but contribute little if the value vector carries little relevant information. A token can receive low attention in one layer and still matter through another path. Residual connections, MLP blocks, and later layers can change the computation.

Attention visualizations can be useful, but they should be combined with gradients, ablations, or causal interventions before making strong claims.

Attribution in Language Models

For language models, attribution often targets one output token. Suppose a model assigns probability to the next token $y_t$ . We may ask which earlier tokens most contributed to that prediction.

The target can be the log probability

F(x) = \log p_\theta(y_t \mid x).

Attribution can then be computed over input token embeddings, attention heads, MLP activations, or residual stream components.

Token-level attribution is useful for studying prompt sensitivity, retrieved context use, hallucination, refusal behavior, and instruction following.

For retrieval-augmented generation, attribution can ask whether the generated answer actually used the retrieved passage. A model may cite a passage without relying on it. Attribution and ablation tests can help detect this failure mode.

Attribution to Training Data

Feature attribution explains the role of input features. Data attribution explains the role of training examples.

The question is: which training examples most influenced this prediction?

Influence functions approximate how the prediction would change if a training example were upweighted or removed. A simplified form uses gradients:

I(z_i, z_{\text{test}}) \approx - \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top H_{\hat{\theta}}^{-1} \nabla_\theta L(z_i, \hat{\theta}).

Here $H_{\hat{\theta}}$ is the Hessian of the training objective at the learned parameters.

Exact influence computation is difficult for large neural networks. Practical alternatives include nearest neighbors in embedding space, gradient similarity, representer methods, and retraining-based approximations.

Data attribution is useful for debugging mislabeled data, understanding memorization, and auditing training-set influence.

Faithfulness and Plausibility

An explanation is plausible if it looks reasonable to a human. It is faithful if it accurately reflects the model’s actual computation.

These are different properties.

A heatmap over an object may look plausible, but the model may still rely on background texture. A sparse token explanation may look convincing, but changing those tokens may not alter the output.

Faithfulness should be tested. Common tests include:

Test	Question
Deletion	Does removing high-attribution content reduce the target score?
Insertion	Does adding high-attribution content increase the target score?
Randomization	Does the explanation change when model weights are randomized?
Counterfactual edit	Does changing attributed content change the prediction?
Component ablation	Does removing an attributed component change behavior?

A useful attribution method should pass at least some faithfulness checks for the intended use case.

Practical PyTorch Pattern

Most PyTorch attribution methods follow the same pattern:

def attribution_pattern(model, x, target_fn):
    model.eval()

    x = x.clone().detach().requires_grad_(True)

    output = model(x)
    target = target_fn(output)

    model.zero_grad(set_to_none=True)
    target.backward()

    return x.grad.detach()

The important design choice is target_fn. It determines what the attribution explains.

For example:

def class_logit_target(class_id):
    def target_fn(logits):
        return logits[:, class_id].sum()
    return target_fn

Then:

grads = attribution_pattern(
    model,
    x,
    target_fn=class_logit_target(3),
)

This computes input gradients for class 3.

For structured models, the same idea applies to intermediate tensors. Use hooks to capture activations and gradients, then compute attribution on those tensors.

Reporting Attribution Results

Attribution results should be reported with enough detail to be reproducible.

A proper report includes:

Item	Example
Target	Class logit for “malignant”
Method	Integrated gradients
Baseline	Black image or feature mean
Reduction	Absolute sum over channels
Model layer	Last convolutional block for Grad-CAM
Input preprocessing	Normalization, resizing, tokenization
Faithfulness check	Deletion test or ablation test
Dataset slice	Correct examples, errors, shifted examples

Without these details, attribution results are hard to interpret and hard to compare.

Summary

Attribution methods assign contribution scores to inputs, features, components, or training examples. Gradient-based methods use local sensitivity. Perturbation methods remove or mask features. Integrated gradients distribute the difference between a baseline and the input. SHAP averages marginal contributions across feature subsets. LIME fits a local surrogate model. Ablation studies measure the effect of removing components.

The main distinction is faithfulness versus plausibility. An attribution that looks convincing may still misrepresent the model. For serious use, attribution should be paired with perturbation tests, ablations, and clear reporting of the target, baseline, and method.