Mechanistic Interpretability

Mechanistic interpretability studies neural networks by treating them as learned computational systems. The goal is to identify the internal mechanisms that produce model behavior: features, circuits, attention heads, neurons, residual stream directions, and layer-to-layer transformations.

Attribution methods ask which input parts contributed to an output. Mechanistic interpretability asks a deeper question: what algorithm did the model implement internally?

For example, a language model may answer a factual question correctly. Attribution can show which prompt tokens mattered. Mechanistic analysis tries to locate how the model retrieves the fact, moves the information through layers, selects the final token, and suppresses alternatives.

From Explanations to Mechanisms

Many interpretability methods produce explanations at the input level. Saliency maps, SHAP values, and token attributions assign scores to input features. These methods are useful, but they often say little about the model’s internal computation.

Mechanistic interpretability instead studies internal states.

A neural network computes a sequence of transformations:

h_0 = x,

h_{l+1} = F_l(h_l),

where $h_l$ is the representation at layer $l$ . The output is produced from the final representation:

\hat{y} = G(h_L).

A mechanism is a structured subcomputation inside this chain. It may involve a small set of neurons, channels, attention heads, or representation directions that jointly implement a function.

Examples include:

Mechanism	Possible role
Edge detector	Detects local image boundaries
Induction head	Copies or continues repeated text patterns
Name mover head	Moves entity information to the output position
Negation feature	Represents whether a statement is negated
Syntax circuit	Tracks grammatical structure
Refusal feature	Activates for disallowed requests
Retrieval circuit	Routes information from context to answer

The aim is to move from vague explanations toward testable hypotheses about computation.

Features

A feature is a meaningful direction, neuron, channel, or pattern in representation space.

In early convolutional networks, some features are easy to visualize. A channel may respond to edges, textures, colors, or object parts. In language models, features are more abstract. They may represent syntax, entities, sentiment, code structure, factual relations, or behavioral policies.

A feature does not always correspond to a single neuron. Neural networks often use distributed representations. One feature may be represented by a direction across many neurons, and one neuron may participate in many features.

This creates superposition: the model stores more features than it has obvious dimensions by overlapping them in representation space.

A representation vector may be written as

h = \sum_i a_i v_i,

where $v_i$ are feature directions and $a_i$ are feature activations.

If the feature directions are not perfectly orthogonal, individual neuron activations can be difficult to interpret.

Neurons, Channels, and Directions

In older interpretability work, researchers often inspected individual neurons. For some models, this works. A neuron may activate for a human-readable concept.

In modern large models, single-neuron interpretation is often insufficient. Important concepts may be encoded as directions in activation space.

A direction $v$ can be tested by projecting an activation $h$ onto it:

a = h^\top v.

If $a$ is large, the feature represented by $v$ is strongly present.

Feature directions can be found in several ways:

Method	Idea
Activation statistics	Find directions that vary with known concepts
Linear probes	Train a simple classifier on hidden states
Sparse autoencoders	Decompose activations into sparse features
Contrastive directions	Compare activations from positive and negative examples
PCA or SVD	Find high-variance directions
Manual search	Inspect activations for known examples

A direction becomes meaningful only after validation. The analyst should test whether changing the direction changes model behavior.

Circuits

A circuit is a set of components that work together to implement a behavior.

In a transformer, a circuit may involve attention heads, MLP neurons, residual stream directions, and layer normalization effects. In a convolutional network, a circuit may involve filters, channels, pooling layers, and downstream classifiers.

A circuit should satisfy three conditions:

Condition	Meaning
Localization	The mechanism is tied to identifiable components
Causal effect	Intervening on those components changes behavior
Specificity	The components explain the target behavior better than unrelated behavior

A circuit-level claim should be causal, not only correlational. If a head appears active during a task but ablating it has no effect, it may be incidental.

Transformer Internals

Mechanistic interpretability is especially active for transformers because their structure exposes useful internal objects.

A transformer block usually contains:

A residual stream.
Multi-head attention.
An MLP block.
Layer normalization.
Residual additions.

The residual stream is the central communication channel. Each layer reads from it and writes back to it.

A simplified block is:

h_{l+1} = h_l + \operatorname{Attn}_l(h_l) + \operatorname{MLP}_l(h_l).

This additive structure makes it possible to study which components write which information into the residual stream.

Attention heads move information between token positions. MLPs often transform or store features at each position. The final unembedding maps the last hidden state into token logits.

Attention Heads as Components

An attention head computes a weighted sum of value vectors from previous positions. For each token position, it decides where to read information from.

A head has query, key, value, and output matrices:

Q = XW_Q,

K = XW_K,

V = XW_V.

The attention pattern is

A = \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right).

The head output is

O = AVW_O.

A mechanistic study may ask:

Question	Interpretation
Which positions does the head attend to?	Routing pattern
What information is stored in the value vectors?	Content moved by the head
Where does the output write in residual space?	Downstream effect
What happens if the head is ablated?	Causal role
Does the head compose with other heads?	Circuit structure

Attention patterns alone do not prove importance. The value and output projections matter as much as the attention weights.

MLPs as Feature Transformers

Transformer MLPs apply position-wise nonlinear transformations. A common form is:

\operatorname{MLP}(h) = W_{\text{out}} \sigma(W_{\text{in}}h + b_{\text{in}}) + b_{\text{out}}.

The hidden units or directions inside the MLP can act as feature detectors. The output matrix then writes feature-related directions back into the residual stream.

In some language models, MLPs appear to store factual, lexical, or behavioral associations. This does not mean every fact is stored in one neuron. More often, information is distributed across many parameters and activated by context.

A useful question is: what inputs activate this MLP feature, and what output direction does it write?

Probing

A probe is a simple model trained on hidden representations to predict some property.

For example, we may collect hidden states $h_l$ and train a linear classifier to predict whether the sentence is in the past tense.

\hat{z} = \operatorname{softmax}(Wh_l + b).

If the probe performs well, the representation contains information about the property.

In PyTorch:

import torch
import torch.nn as nn

class LinearProbe(nn.Module):
    def __init__(self, hidden_dim, num_classes):
        super().__init__()
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, hidden):
        return self.classifier(hidden)

A probe must be interpreted carefully. High probe accuracy means the information is decodable. It does not prove the model uses that information for its prediction.

To strengthen the claim, combine probing with intervention. Remove or alter the probed direction and measure whether behavior changes.

Activation Patching

Activation patching is a causal method. It compares a clean run and a corrupted run.

Suppose a model answers correctly for a clean prompt but incorrectly for a corrupted prompt. We run both prompts, then replace one internal activation in the corrupted run with the corresponding activation from the clean run. If this restores the correct answer, that activation is likely important.

The procedure is:

Run the clean input and cache activations.
Run the corrupted input.
Patch one activation from the clean cache into the corrupted run.
Measure whether the output recovers.
Repeat across layers, positions, and components.

This gives a causal map of where information is represented.

A simplified score is:

\text{patch effect} = F_{\text{patched}} - F_{\text{corrupt}},

where $F$ is a scalar measure such as the correct-token logit difference.

Activation patching is expensive but powerful. It can locate the layer and position where a model carries task-relevant information.

Ablation

Ablation removes or modifies a component and measures the behavioral effect.

For a component $c$ , we compare:

F_{\text{original}}(x)

with

F_{\text{ablated }c}(x).

The difference estimates the component’s causal contribution.

Common ablations include:

Ablation	Description
Zero ablation	Replace activation with zero
Mean ablation	Replace activation with dataset mean
Resample ablation	Replace activation with activation from another example
Head ablation	Remove one attention head
Neuron ablation	Remove one neuron or feature
Direction ablation	Remove projection onto a feature direction

Mean and resample ablation are often better than zero ablation because zero may be outside the normal activation distribution.

Sparse Autoencoders

Sparse autoencoders are used to decompose dense activations into more interpretable features.

Given an activation vector $h$ , a sparse autoencoder learns:

z = \sigma(W_e h + b_e),

\hat{h} = W_d z + b_d,

where $z$ is encouraged to be sparse.

The training objective is often:

\|h - \hat{h}\|_2^2 + \lambda \|z\|_1.

The hope is that each sparse feature $z_i$ corresponds to a more interpretable concept than a raw neuron.

After training, each feature can be studied by finding examples that activate it strongly. The feature can also be ablated or amplified to test causal effect.

Sparse autoencoders are useful because they address superposition. They try to recover a larger set of sparse features from a smaller dense representation space.

Logit Lens

The logit lens projects intermediate hidden states through the model’s final unembedding matrix to see what token predictions are already present.

If $h_l$ is the hidden state at layer $l$ , the logit lens computes:

z_l = W_U h_l,

where $W_U$ is the unembedding matrix.

This gives a vocabulary-sized score vector at each layer.

The logit lens can show how a language model gradually forms an answer. Early layers may represent broad syntax or topic. Later layers may sharpen toward a specific token.

The method is simple, but it has limitations. Intermediate residual states may not be calibrated for the final unembedding. Tuned variants add normalization or learned affine maps.

Steering and Representation Editing

If a feature direction controls a behavior, we can sometimes steer the model by adding or subtracting that direction from an activation.

For hidden state $h$ , steering uses:

h' = h + \alpha v,

where $v$ is a feature direction and $\alpha$ controls strength.

Examples include increasing sentiment, reducing toxicity, changing style, or encouraging refusal. Steering is a causal test: if adding the direction reliably changes behavior, the direction likely participates in that behavior.

However, steering can have side effects. A direction may affect multiple features. Large interventions can push activations outside the model’s normal distribution.

Practical PyTorch Hook Pattern

Mechanistic analysis often requires hooks. A hook lets us inspect or modify activations during a forward or backward pass.

A forward hook can cache activations:

class ActivationCache:
    def __init__(self):
        self.data = {}

    def save(self, name):
        def hook(module, inputs, output):
            self.data[name] = output.detach()
        return hook

Usage:

cache = ActivationCache()

handle = model.layer.register_forward_hook(
    cache.save("layer")
)

with torch.no_grad():
    output = model(x)

activation = cache.data["layer"]

handle.remove()

A modifying hook can patch activations:

def patch_activation(replacement):
    def hook(module, inputs, output):
        return replacement
    return hook

Hooks are powerful but must be used carefully. They can change tensor shapes, break gradient flow, or produce invalid activation distributions.

Limits of Mechanistic Interpretability

Mechanistic interpretability remains an active research area. Current methods work best on small models, specific behaviors, and localized circuits. Large foundation models contain many overlapping mechanisms, and their behavior may involve distributed computation across many layers and features.

Common limits include:

Limit	Explanation
Scale	Large models have too many components to inspect manually
Superposition	Features overlap in representation space
Polysemanticity	One neuron or feature may respond to several concepts
Dataset dependence	Findings may depend on chosen examples
Intervention artifacts	Ablations may create unnatural states
Partial explanations	A found circuit may explain only part of the behavior

A mechanistic result should therefore be stated narrowly. It should specify the model, task, dataset, components, and intervention used.

Summary

Mechanistic interpretability studies the internal computations learned by neural networks. It tries to identify features, directions, neurons, attention heads, MLPs, and circuits that causally produce behavior.

The main tools are probing, activation patching, ablation, sparse autoencoders, logit lens analysis, and representation steering. The strongest evidence comes from causal intervention: changing an internal component and observing a predictable change in output.

The field aims to move from surface-level explanations toward a detailed account of learned algorithms. Its current results are useful but usually local, partial, and model-specific.