Attribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output.
Attribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output. Saliency maps are one family of attribution methods. This section treats attribution more broadly.
The central question is: which parts of the computation were responsible for this prediction?
For an image classifier, attribution may ask which pixels supported the class “dog.” For a text classifier, it may ask which tokens caused a toxic label. For a recommender, it may ask which user features or past events contributed most to a ranking score.
Attribution is useful for debugging, auditing, and scientific analysis. It has limits. A numerical attribution score depends on a chosen method, baseline, and model interface. It should be read as evidence, not as a complete explanation.
Attribution Targets
An attribution method needs a target quantity. Usually this is a scalar output from the model.
Examples include:
| Target | Meaning |
|---|---|
| Logit | Score for class before softmax |
| Probability | Predicted probability for class |
| Loss | Training objective for an input-label pair |
| Embedding similarity | Retrieval score between two vectors |
| Reward estimate | Value used by a policy or agent |
| Token log probability | Language model score for a generated token |
For classification, logits are often better attribution targets than probabilities. Softmax probabilities couple all classes together and may saturate. A logit is more direct: it measures evidence for one class before normalization.
Let the target scalar be
An attribution method returns scores
where measures the contribution of input feature to .
Local Versus Global Attribution
Local attribution explains one prediction. Global attribution explains model behavior over a dataset.
A local explanation might say that pixels around a wheel contributed to a “car” prediction for one image. A global explanation might say that the model generally relies on wheels, road texture, and side-view contours when predicting “car.”
Local attribution is useful for case analysis. Global attribution is useful for auditing model behavior.
A simple way to estimate global attribution is to average local attribution scores over many examples:
This works best for tabular data where feature has the same meaning across examples. For images and text, features are position-dependent, so aggregation requires care.
Gradient-Based Attribution
Gradient-based attribution uses derivatives of the target output with respect to the input.
The simplest form is
This measures local sensitivity. If is large, a small change in would strongly change .
A related method is gradient times input:
This combines sensitivity with feature magnitude. A feature receives high attribution when it is both present and influential.
In PyTorch:
import torch
def gradient_times_input(model, x, target_class=None):
model.eval()
x = x.clone().detach().requires_grad_(True)
logits = model(x)
if target_class is None:
target_class = logits.argmax(dim=1)
target = logits.gather(1, target_class.view(-1, 1)).sum()
model.zero_grad(set_to_none=True)
target.backward()
attribution = x * x.grad
return attribution.detach()For images, this tensor is usually reduced across channels. For tabular inputs, each feature attribution can be inspected directly.
Perturbation-Based Attribution
Perturbation methods change parts of the input and measure the effect on the output.
If feature is replaced by a baseline value , the attribution can be estimated as
where means the input with feature removed, masked, or replaced.
This approach is model-agnostic. It does not need gradients. It works with neural networks, tree models, retrieval systems, and black-box APIs.
The drawback is cost. Evaluating each feature separately requires many forward passes. For high-dimensional inputs, perturbation methods often use groups of features, such as image patches or text spans.
@torch.no_grad()
def feature_ablation(model, x, target_class, baseline=0.0):
model.eval()
logits = model(x)
base_score = logits.gather(1, target_class.view(-1, 1)).squeeze(1)
B, D = x.shape
attributions = torch.zeros_like(x)
for j in range(D):
x_masked = x.clone()
x_masked[:, j] = baseline
logits_masked = model(x_masked)
score_masked = logits_masked.gather(
1,
target_class.view(-1, 1),
).squeeze(1)
attributions[:, j] = base_score - score_masked
return attributionsThis example assumes tabular input of shape [B, D].
Occlusion, Ablation, and Masking
Occlusion removes a region of the input. Ablation removes a feature, neuron, layer, head, or component. Masking replaces part of the input with a neutral value.
These methods share the same logic: remove something and observe the change.
For a model component , an ablation score can be written as
If removing greatly changes the output, then was important for that prediction.
In transformers, ablation may be applied to attention heads, MLP channels, residual stream directions, or whole layers. In convolutional networks, ablation may be applied to channels or spatial regions.
Ablation is closer to causal intervention than a pure gradient, but it still depends on how the component is removed. Replacing with zero, mean activation, noise, or a resampled value can produce different conclusions.
Integrated Gradients Revisited
Integrated gradients are an attribution method with a completeness property. The total attribution approximately equals the change in model output from baseline to input:
Here is the baseline input.
This property makes integrated gradients more interpretable than raw gradients in many settings. The method distributes the output difference across input features.
The baseline remains important. For an image, a black image may be reasonable for some datasets and poor for others. For text, a padding-token baseline may not represent a neutral sentence. For tabular data, a population mean, zero vector, or domain-specific reference point may be better.
SHAP Values
SHAP values come from cooperative game theory. They attribute a model output to input features by averaging each feature’s marginal contribution across many possible feature subsets.
For feature , the Shapley value is
Here is the set of all features, is a subset of features, and is the total number of features.
The idea is simple: measure how much feature helps when added to every possible context, then average those contributions fairly.
Exact SHAP is expensive because it requires evaluating many feature subsets. Practical implementations use approximations.
SHAP is most common for tabular models. It is less straightforward for images and language because “removing a feature” is ambiguous. Removing a word, pixel, or patch can create inputs that are outside the natural data distribution.
LIME
LIME explains a prediction by fitting a simple local surrogate model around the input.
The procedure is:
- Generate perturbed versions of the input.
- Evaluate the original model on those perturbations.
- Weight perturbations by similarity to the original input.
- Fit an interpretable model, often sparse linear regression.
- Use the surrogate model’s coefficients as attributions.
Mathematically, LIME solves a local approximation problem:
Here is a class of interpretable models, gives higher weight to perturbations near , and penalizes complexity.
LIME is model-agnostic and intuitive. Its weakness is instability. Different perturbation choices, distance metrics, and random seeds may give different explanations.
Attention as Attribution
Attention weights are sometimes interpreted as explanations. In a transformer, an attention head computes weights over tokens, so it is tempting to say that high attention means high importance.
This interpretation is incomplete. Attention weights show where information is read from inside one operation. They do not by themselves measure causal contribution to the final output.
A token can receive high attention but contribute little if the value vector carries little relevant information. A token can receive low attention in one layer and still matter through another path. Residual connections, MLP blocks, and later layers can change the computation.
Attention visualizations can be useful, but they should be combined with gradients, ablations, or causal interventions before making strong claims.
Attribution in Language Models
For language models, attribution often targets one output token. Suppose a model assigns probability to the next token . We may ask which earlier tokens most contributed to that prediction.
The target can be the log probability
Attribution can then be computed over input token embeddings, attention heads, MLP activations, or residual stream components.
Token-level attribution is useful for studying prompt sensitivity, retrieved context use, hallucination, refusal behavior, and instruction following.
For retrieval-augmented generation, attribution can ask whether the generated answer actually used the retrieved passage. A model may cite a passage without relying on it. Attribution and ablation tests can help detect this failure mode.
Attribution to Training Data
Feature attribution explains the role of input features. Data attribution explains the role of training examples.
The question is: which training examples most influenced this prediction?
Influence functions approximate how the prediction would change if a training example were upweighted or removed. A simplified form uses gradients:
Here is the Hessian of the training objective at the learned parameters.
Exact influence computation is difficult for large neural networks. Practical alternatives include nearest neighbors in embedding space, gradient similarity, representer methods, and retraining-based approximations.
Data attribution is useful for debugging mislabeled data, understanding memorization, and auditing training-set influence.
Faithfulness and Plausibility
An explanation is plausible if it looks reasonable to a human. It is faithful if it accurately reflects the model’s actual computation.
These are different properties.
A heatmap over an object may look plausible, but the model may still rely on background texture. A sparse token explanation may look convincing, but changing those tokens may not alter the output.
Faithfulness should be tested. Common tests include:
| Test | Question |
|---|---|
| Deletion | Does removing high-attribution content reduce the target score? |
| Insertion | Does adding high-attribution content increase the target score? |
| Randomization | Does the explanation change when model weights are randomized? |
| Counterfactual edit | Does changing attributed content change the prediction? |
| Component ablation | Does removing an attributed component change behavior? |
A useful attribution method should pass at least some faithfulness checks for the intended use case.
Practical PyTorch Pattern
Most PyTorch attribution methods follow the same pattern:
def attribution_pattern(model, x, target_fn):
model.eval()
x = x.clone().detach().requires_grad_(True)
output = model(x)
target = target_fn(output)
model.zero_grad(set_to_none=True)
target.backward()
return x.grad.detach()The important design choice is target_fn. It determines what the attribution explains.
For example:
def class_logit_target(class_id):
def target_fn(logits):
return logits[:, class_id].sum()
return target_fnThen:
grads = attribution_pattern(
model,
x,
target_fn=class_logit_target(3),
)This computes input gradients for class 3.
For structured models, the same idea applies to intermediate tensors. Use hooks to capture activations and gradients, then compute attribution on those tensors.
Reporting Attribution Results
Attribution results should be reported with enough detail to be reproducible.
A proper report includes:
| Item | Example |
|---|---|
| Target | Class logit for “malignant” |
| Method | Integrated gradients |
| Baseline | Black image or feature mean |
| Reduction | Absolute sum over channels |
| Model layer | Last convolutional block for Grad-CAM |
| Input preprocessing | Normalization, resizing, tokenization |
| Faithfulness check | Deletion test or ablation test |
| Dataset slice | Correct examples, errors, shifted examples |
Without these details, attribution results are hard to interpret and hard to compare.
Summary
Attribution methods assign contribution scores to inputs, features, components, or training examples. Gradient-based methods use local sensitivity. Perturbation methods remove or mask features. Integrated gradients distribute the difference between a baseline and the input. SHAP averages marginal contributions across feature subsets. LIME fits a local surrogate model. Ablation studies measure the effect of removing components.
The main distinction is faithfulness versus plausibility. An attribution that looks convincing may still misrepresent the model. For serious use, attribution should be paired with perturbation tests, ablations, and clear reporting of the target, baseline, and method.