# Mechanistic Interpretability

Mechanistic interpretability studies neural networks by treating them as learned computational systems. The goal is to identify the internal mechanisms that produce model behavior: features, circuits, attention heads, neurons, residual stream directions, and layer-to-layer transformations.

Attribution methods ask which input parts contributed to an output. Mechanistic interpretability asks a deeper question: what algorithm did the model implement internally?

For example, a language model may answer a factual question correctly. Attribution can show which prompt tokens mattered. Mechanistic analysis tries to locate how the model retrieves the fact, moves the information through layers, selects the final token, and suppresses alternatives.

### From Explanations to Mechanisms

Many interpretability methods produce explanations at the input level. Saliency maps, SHAP values, and token attributions assign scores to input features. These methods are useful, but they often say little about the model’s internal computation.

Mechanistic interpretability instead studies internal states.

A neural network computes a sequence of transformations:

$$
h_0 = x,
$$

$$
h_{l+1} = F_l(h_l),
$$

where $h_l$ is the representation at layer $l$. The output is produced from the final representation:

$$
\hat{y} = G(h_L).
$$

A mechanism is a structured subcomputation inside this chain. It may involve a small set of neurons, channels, attention heads, or representation directions that jointly implement a function.

Examples include:

| Mechanism | Possible role |
|---|---|
| Edge detector | Detects local image boundaries |
| Induction head | Copies or continues repeated text patterns |
| Name mover head | Moves entity information to the output position |
| Negation feature | Represents whether a statement is negated |
| Syntax circuit | Tracks grammatical structure |
| Refusal feature | Activates for disallowed requests |
| Retrieval circuit | Routes information from context to answer |

The aim is to move from vague explanations toward testable hypotheses about computation.

### Features

A feature is a meaningful direction, neuron, channel, or pattern in representation space.

In early convolutional networks, some features are easy to visualize. A channel may respond to edges, textures, colors, or object parts. In language models, features are more abstract. They may represent syntax, entities, sentiment, code structure, factual relations, or behavioral policies.

A feature does not always correspond to a single neuron. Neural networks often use distributed representations. One feature may be represented by a direction across many neurons, and one neuron may participate in many features.

This creates superposition: the model stores more features than it has obvious dimensions by overlapping them in representation space.

A representation vector may be written as

$$
h = \sum_i a_i v_i,
$$

where $v_i$ are feature directions and $a_i$ are feature activations.

If the feature directions are not perfectly orthogonal, individual neuron activations can be difficult to interpret.

### Neurons, Channels, and Directions

In older interpretability work, researchers often inspected individual neurons. For some models, this works. A neuron may activate for a human-readable concept.

In modern large models, single-neuron interpretation is often insufficient. Important concepts may be encoded as directions in activation space.

A direction $v$ can be tested by projecting an activation $h$ onto it:

$$
a = h^\top v.
$$

If $a$ is large, the feature represented by $v$ is strongly present.

Feature directions can be found in several ways:

| Method | Idea |
|---|---|
| Activation statistics | Find directions that vary with known concepts |
| Linear probes | Train a simple classifier on hidden states |
| Sparse autoencoders | Decompose activations into sparse features |
| Contrastive directions | Compare activations from positive and negative examples |
| PCA or SVD | Find high-variance directions |
| Manual search | Inspect activations for known examples |

A direction becomes meaningful only after validation. The analyst should test whether changing the direction changes model behavior.

### Circuits

A circuit is a set of components that work together to implement a behavior.

In a transformer, a circuit may involve attention heads, MLP neurons, residual stream directions, and layer normalization effects. In a convolutional network, a circuit may involve filters, channels, pooling layers, and downstream classifiers.

A circuit should satisfy three conditions:

| Condition | Meaning |
|---|---|
| Localization | The mechanism is tied to identifiable components |
| Causal effect | Intervening on those components changes behavior |
| Specificity | The components explain the target behavior better than unrelated behavior |

A circuit-level claim should be causal, not only correlational. If a head appears active during a task but ablating it has no effect, it may be incidental.

### Transformer Internals

Mechanistic interpretability is especially active for transformers because their structure exposes useful internal objects.

A transformer block usually contains:

1. A residual stream.
2. Multi-head attention.
3. An MLP block.
4. Layer normalization.
5. Residual additions.

The residual stream is the central communication channel. Each layer reads from it and writes back to it.

A simplified block is:

$$
h_{l+1} =
h_l
+
\operatorname{Attn}_l(h_l)
+
\operatorname{MLP}_l(h_l).
$$

This additive structure makes it possible to study which components write which information into the residual stream.

Attention heads move information between token positions. MLPs often transform or store features at each position. The final unembedding maps the last hidden state into token logits.

### Attention Heads as Components

An attention head computes a weighted sum of value vectors from previous positions. For each token position, it decides where to read information from.

A head has query, key, value, and output matrices:

$$
Q = XW_Q,
$$

$$
K = XW_K,
$$

$$
V = XW_V.
$$

The attention pattern is

$$
A = \operatorname{softmax}
\left(
\frac{QK^\top}{\sqrt{d_k}}
\right).
$$

The head output is

$$
O = AVW_O.
$$

A mechanistic study may ask:

| Question | Interpretation |
|---|---|
| Which positions does the head attend to? | Routing pattern |
| What information is stored in the value vectors? | Content moved by the head |
| Where does the output write in residual space? | Downstream effect |
| What happens if the head is ablated? | Causal role |
| Does the head compose with other heads? | Circuit structure |

Attention patterns alone do not prove importance. The value and output projections matter as much as the attention weights.

### MLPs as Feature Transformers

Transformer MLPs apply position-wise nonlinear transformations. A common form is:

$$
\operatorname{MLP}(h) =
W_{\text{out}}
\sigma(W_{\text{in}}h + b_{\text{in}})
+
b_{\text{out}}.
$$

The hidden units or directions inside the MLP can act as feature detectors. The output matrix then writes feature-related directions back into the residual stream.

In some language models, MLPs appear to store factual, lexical, or behavioral associations. This does not mean every fact is stored in one neuron. More often, information is distributed across many parameters and activated by context.

A useful question is: what inputs activate this MLP feature, and what output direction does it write?

### Probing

A probe is a simple model trained on hidden representations to predict some property.

For example, we may collect hidden states $h_l$ and train a linear classifier to predict whether the sentence is in the past tense.

$$
\hat{z} = \operatorname{softmax}(Wh_l + b).
$$

If the probe performs well, the representation contains information about the property.

In PyTorch:

```python
import torch
import torch.nn as nn

class LinearProbe(nn.Module):
    def __init__(self, hidden_dim, num_classes):
        super().__init__()
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, hidden):
        return self.classifier(hidden)
```

A probe must be interpreted carefully. High probe accuracy means the information is decodable. It does not prove the model uses that information for its prediction.

To strengthen the claim, combine probing with intervention. Remove or alter the probed direction and measure whether behavior changes.

### Activation Patching

Activation patching is a causal method. It compares a clean run and a corrupted run.

Suppose a model answers correctly for a clean prompt but incorrectly for a corrupted prompt. We run both prompts, then replace one internal activation in the corrupted run with the corresponding activation from the clean run. If this restores the correct answer, that activation is likely important.

The procedure is:

1. Run the clean input and cache activations.
2. Run the corrupted input.
3. Patch one activation from the clean cache into the corrupted run.
4. Measure whether the output recovers.
5. Repeat across layers, positions, and components.

This gives a causal map of where information is represented.

A simplified score is:

$$
\text{patch effect} =
F_{\text{patched}} - F_{\text{corrupt}},
$$

where $F$ is a scalar measure such as the correct-token logit difference.

Activation patching is expensive but powerful. It can locate the layer and position where a model carries task-relevant information.

### Ablation

Ablation removes or modifies a component and measures the behavioral effect.

For a component $c$, we compare:

$$
F_{\text{original}}(x)
$$

with

$$
F_{\text{ablated }c}(x).
$$

The difference estimates the component’s causal contribution.

Common ablations include:

| Ablation | Description |
|---|---|
| Zero ablation | Replace activation with zero |
| Mean ablation | Replace activation with dataset mean |
| Resample ablation | Replace activation with activation from another example |
| Head ablation | Remove one attention head |
| Neuron ablation | Remove one neuron or feature |
| Direction ablation | Remove projection onto a feature direction |

Mean and resample ablation are often better than zero ablation because zero may be outside the normal activation distribution.

### Sparse Autoencoders

Sparse autoencoders are used to decompose dense activations into more interpretable features.

Given an activation vector $h$, a sparse autoencoder learns:

$$
z = \sigma(W_e h + b_e),
$$

$$
\hat{h} = W_d z + b_d,
$$

where $z$ is encouraged to be sparse.

The training objective is often:

$$
\|h - \hat{h}\|_2^2 + \lambda \|z\|_1.
$$

The hope is that each sparse feature $z_i$ corresponds to a more interpretable concept than a raw neuron.

After training, each feature can be studied by finding examples that activate it strongly. The feature can also be ablated or amplified to test causal effect.

Sparse autoencoders are useful because they address superposition. They try to recover a larger set of sparse features from a smaller dense representation space.

### Logit Lens

The logit lens projects intermediate hidden states through the model’s final unembedding matrix to see what token predictions are already present.

If $h_l$ is the hidden state at layer $l$, the logit lens computes:

$$
z_l = W_U h_l,
$$

where $W_U$ is the unembedding matrix.

This gives a vocabulary-sized score vector at each layer.

The logit lens can show how a language model gradually forms an answer. Early layers may represent broad syntax or topic. Later layers may sharpen toward a specific token.

The method is simple, but it has limitations. Intermediate residual states may not be calibrated for the final unembedding. Tuned variants add normalization or learned affine maps.

### Steering and Representation Editing

If a feature direction controls a behavior, we can sometimes steer the model by adding or subtracting that direction from an activation.

For hidden state $h$, steering uses:

$$
h' = h + \alpha v,
$$

where $v$ is a feature direction and $\alpha$ controls strength.

Examples include increasing sentiment, reducing toxicity, changing style, or encouraging refusal. Steering is a causal test: if adding the direction reliably changes behavior, the direction likely participates in that behavior.

However, steering can have side effects. A direction may affect multiple features. Large interventions can push activations outside the model’s normal distribution.

### Practical PyTorch Hook Pattern

Mechanistic analysis often requires hooks. A hook lets us inspect or modify activations during a forward or backward pass.

A forward hook can cache activations:

```python
class ActivationCache:
    def __init__(self):
        self.data = {}

    def save(self, name):
        def hook(module, inputs, output):
            self.data[name] = output.detach()
        return hook
```

Usage:

```python
cache = ActivationCache()

handle = model.layer.register_forward_hook(
    cache.save("layer")
)

with torch.no_grad():
    output = model(x)

activation = cache.data["layer"]

handle.remove()
```

A modifying hook can patch activations:

```python
def patch_activation(replacement):
    def hook(module, inputs, output):
        return replacement
    return hook
```

Hooks are powerful but must be used carefully. They can change tensor shapes, break gradient flow, or produce invalid activation distributions.

### Limits of Mechanistic Interpretability

Mechanistic interpretability remains an active research area. Current methods work best on small models, specific behaviors, and localized circuits. Large foundation models contain many overlapping mechanisms, and their behavior may involve distributed computation across many layers and features.

Common limits include:

| Limit | Explanation |
|---|---|
| Scale | Large models have too many components to inspect manually |
| Superposition | Features overlap in representation space |
| Polysemanticity | One neuron or feature may respond to several concepts |
| Dataset dependence | Findings may depend on chosen examples |
| Intervention artifacts | Ablations may create unnatural states |
| Partial explanations | A found circuit may explain only part of the behavior |

A mechanistic result should therefore be stated narrowly. It should specify the model, task, dataset, components, and intervention used.

### Summary

Mechanistic interpretability studies the internal computations learned by neural networks. It tries to identify features, directions, neurons, attention heads, MLPs, and circuits that causally produce behavior.

The main tools are probing, activation patching, ablation, sparse autoencoders, logit lens analysis, and representation steering. The strongest evidence comes from causal intervention: changing an internal component and observing a predictable change in output.

The field aims to move from surface-level explanations toward a detailed account of learned algorithms. Its current results are useful but usually local, partial, and model-specific.