Model Editing

Model editing modifies a trained model so that it changes a specific behavior while preserving most other behaviors. The goal is to update knowledge, remove undesirable outputs, correct mistakes, or alter policies without retraining the entire model.

For example, suppose a language model answers:

“The capital of Australia is Sydney.”

A model edit attempts to change this fact so that the model answers:

“The capital of Australia is Canberra.”

The edit should generalize to related prompts:

Prompt	Desired output
“What is the capital of Australia?”	“Canberra”
“Australia’s capital city is”	“Canberra”
“The government of Australia is based in”	“Canberra”

At the same time, unrelated knowledge should remain stable. The model should still answer correctly for unrelated countries, mathematics, code generation, and reasoning tasks.

This balance between local modification and global preservation is the core challenge of model editing.

Why Model Editing Matters

Large models are trained on massive datasets and may contain outdated, incorrect, unsafe, or undesirable behaviors.

Editing is useful for:

Use case	Example
Correcting factual errors	Updating outdated knowledge
Safety refinement	Removing harmful outputs
Policy alignment	Changing refusal behavior
Personalization	Adding user-specific preferences
Domain adaptation	Injecting organization-specific facts
Debugging	Removing spurious behaviors
Scientific analysis	Testing causal hypotheses

Retraining a large model from scratch is expensive. Fine-tuning the entire model may unintentionally change many unrelated behaviors. Editing methods attempt to make smaller and more targeted updates.

The Editing Objective

Let a model with parameters $\theta$ produce output

f_\theta(x).

Suppose we want to enforce a desired behavior:

f_{\theta'}(x_e) = y_e,

where $x_e$ is the edit prompt and $y_e$ is the desired target output.

The updated parameters $\theta'$ should also preserve the original model behavior on unrelated inputs:

f_{\theta'}(x) \approx f_\theta(x) \quad \text{for unrelated } x.

A good edit therefore requires:

Property	Meaning
Reliability	The target behavior changes correctly
Generalization	Related prompts also change
Locality	Unrelated behaviors remain stable
Stability	Multiple edits do not interfere destructively
Efficiency	Editing is cheaper than retraining

These properties often conflict. A large parameter change may strongly enforce the edit but damage unrelated capabilities.

Fine-Tuning as Editing

The simplest editing method is fine-tuning.

We create an editing dataset:

\mathcal{D}_{\text{edit}} = \{ (x_i, y_i) \},

and optimize:

\theta' = \arg\min_\theta \sum_i L(f_\theta(x_i), y_i).

In PyTorch:

import torch
import torch.nn.functional as F

def fine_tune_step(model, optimizer, input_ids, labels):
    model.train()

    outputs = model(input_ids=input_ids, labels=labels)
    loss = outputs.loss

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()

Fine-tuning is flexible and easy to implement, but it lacks locality. Updating parameters for one fact may accidentally change other behaviors.

For small edits, lightweight adaptation methods are often preferred.

Low-Rank Adaptation

Low-rank adaptation methods modify only small trainable matrices while freezing most model parameters.

Suppose a weight matrix is

W \in \mathbb{R}^{m \times n}.

A low-rank update uses:

W' = W + BA,

where

B \in \mathbb{R}^{m \times r}, \quad A \in \mathbb{R}^{r \times n},

and $r$ is small.

Only $A$ and $B$ are trained.

This reduces memory and computation cost while limiting how much the model changes. In practice, low-rank editing can inject new behaviors with relatively small interference.

A simplified PyTorch implementation:

import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8):
        super().__init__()

        self.weight = nn.Parameter(
            torch.randn(out_features, in_features),
            requires_grad=False,
        )

        self.A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.B = nn.Parameter(torch.randn(out_features, rank) * 0.01)

    def forward(self, x):
        delta = self.B @ self.A
        weight = self.weight + delta

        return x @ weight.t()

This example is simplified but illustrates the structure of a low-rank update.

Knowledge Localization

Modern editing methods often assume that factual knowledge is localized in specific model components.

For example, a transformer may store a factual association in a subset of MLP layers. If we can identify where the fact is represented, we may be able to update only those parameters.

Suppose a hidden representation is

h_l = F_l(h_{l-1}).

A factual edit may target only one layer:

h_l' = F_l'(h_{l-1}),

while leaving other layers unchanged.

This motivates methods that first locate relevant layers and then apply targeted parameter updates.

Rank-One Model Editing

Some editing methods approximate the required parameter change with a rank-one update.

Suppose a layer uses:

y = Wh.

We want to change the output for one hidden representation $h_e$ . A rank-one update has the form:

W' = W + uv^\top.

The vectors $u$ and $v$ are chosen so that:

W' h_e

moves toward the desired representation.

The update changes the weight matrix minimally outside the targeted direction.

The intuition is geometric. A rank-one update changes the model strongly in one direction of representation space while leaving most directions mostly unchanged.

Editing as Memory Modification

A language model can be viewed as storing associations between contexts and continuations.

For example:

\text{“The capital of Australia is”} \rightarrow \text{“Canberra”}.

Editing modifies this association.

Mechanistically, the model may:

Detect the subject “Australia.”
Retrieve a representation associated with the country.
Transform that representation into a city token distribution.
Produce the next token.

An edit changes some part of this retrieval or transformation process.

This motivates causal tracing methods that identify which layers and positions contribute most to factual recall.

Activation Editing

Instead of changing model parameters, we can edit activations during inference.

Suppose a hidden state is

h.

We apply a steering vector:

h' = h + \alpha v,

where $v$ is a learned direction and $\alpha$ controls strength.

This method is temporary and reversible. It changes behavior only during inference.

Activation editing is useful for:

Goal	Example
Toxicity reduction	Push activations away from toxic directions
Style control	Increase formal or concise writing
Persona control	Encourage a character or role
Refusal control	Increase or decrease refusal behavior
Sentiment steering	Shift emotional tone

Activation editing is less stable than parameter editing because it depends on runtime conditions and context.

Representation Engineering

Representation engineering treats hidden states as structured semantic spaces.

Suppose we identify two sets of examples:

Set	Meaning
Positive examples	Helpful, truthful outputs
Negative examples	Harmful or deceptive outputs

We compute average hidden states:

\mu_+, \quad \mu_-.

A steering direction is then:

v = \mu_+ - \mu_-.

Adding this direction during inference may encourage the model toward the positive behavior.

This approach is simple and often surprisingly effective. However, the learned direction may entangle multiple behaviors.

Editing Safety Policies

Safety editing attempts to change harmful or unsafe behavior while preserving normal capability.

Examples include:

Behavior	Desired edit
Harmful instructions	Increase refusal
Hallucinated medical advice	Increase uncertainty
Toxic language	Reduce generation probability
Privacy leakage	Block memorized outputs

A naïve safety edit can damage usefulness. Overly strong refusal behavior may suppress benign requests.

This creates a capability-alignment tradeoff:

Weak edit	Unsafe outputs remain
Strong edit	Useful behavior may degrade

Safety editing therefore requires careful evaluation on both harmful and benign tasks.

Catastrophic Interference

Edits can interfere with previous knowledge.

Suppose we edit one fact:

“The CEO of company X is Alice.”

The update may unintentionally affect related facts:

Unintended change	Example
Entity confusion	Wrongly changing another company
Relation drift	Changing other CEO facts
Overgeneralization	Altering unrelated countries or names
Language drift	Affecting multilingual outputs

This problem is called catastrophic interference.

Interference occurs because neural parameters are shared across many behaviors. A parameter may participate in many overlapping circuits.

One defense is locality regularization:

L_{\text{total}} = L_{\text{edit}} + \lambda L_{\text{preserve}},

where $L_{\text{preserve}}$ penalizes deviation on unrelated examples.

Sequential Editing

Single edits are easier than many edits.

If we apply many sequential edits:

\theta \rightarrow \theta_1 \rightarrow \theta_2 \rightarrow \cdots,

the updates may accumulate and degrade the model.

A good editing system should support:

Requirement	Meaning
Edit compositionality	Multiple edits coexist
Memory retention	Old edits remain stable
Conflict resolution	Contradictory edits handled safely
Efficient updates	New edits remain cheap

This is an active research area. Current methods often degrade after many edits.

Evaluation of Model Editing

Editing methods should be evaluated systematically.

Common evaluation dimensions include:

Metric	Meaning
Edit success	Does the target output change correctly?
Paraphrase generalization	Does the edit transfer to related prompts?
Locality	Are unrelated outputs preserved?
Fluency	Does generation remain natural?
Robustness	Does the edit survive prompt variation?
Sequential stability	Do multiple edits coexist?

A good benchmark should include:

Direct edit prompts.
Paraphrased prompts.
Neighboring unrelated prompts.
Multi-hop reasoning prompts.
Long-context prompts.

Testing only the exact edited sentence is insufficient.

Editing and Hallucination

Model editing interacts closely with hallucination.

A factual edit may reduce one hallucination while creating others. For example, aggressively injecting new knowledge may distort related representations.

A model may also produce conflicting facts depending on phrasing. This indicates that the edit generalized unevenly across contexts.

One practical approach is hybrid systems:

Component	Role
Base model	General reasoning
Retrieval system	Current factual knowledge
Editing layer	Small behavioral corrections

This reduces pressure on parameter editing alone.

Editing Versus Retrieval

A key design question is whether knowledge should be stored in parameters or retrieved externally.

Parameter editing modifies internal memory. Retrieval-augmented systems instead fetch external information at inference time.

Comparison:

Method	Advantages	Weaknesses
Parameter editing	Fast inference, integrated behavior	Hard to update safely
Retrieval	Easy updates, external verification	Retrieval failures possible
Hybrid systems	Flexible and updatable	More system complexity

Modern systems increasingly combine both approaches.

Causal Perspective

A strong edit should correspond to a causal modification of the computation.

Suppose a fact is represented through a circuit:

x \rightarrow h_1 \rightarrow h_2 \rightarrow y.

An edit may intervene on:

Intervention level	Example
Input level	Prompt engineering
Activation level	Steering vectors
Weight level	Rank-one updates
Architectural level	Adding retrieval modules

Mechanistic interpretability helps identify where these interventions should occur.

Practical PyTorch Pattern

A simple editing workflow in PyTorch often follows this structure:

def edit_step(model, optimizer, batch):
    model.train()

    outputs = model(
        input_ids=batch["input_ids"],
        labels=batch["labels"],
    )

    loss = outputs.loss

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()

For localized editing:

Freeze most parameters.
Train only adapters or low-rank matrices.
Evaluate locality on held-out prompts.
Compare outputs before and after editing.

Selective parameter training:

for name, param in model.named_parameters():
    param.requires_grad = False

for param in model.adapter.parameters():
    param.requires_grad = True

This reduces interference and computational cost.

Limits of Model Editing

Model editing remains difficult for large systems.

Current limitations include:

Limitation	Description
Entangled representations	One parameter affects many behaviors
Poor locality	Edits spread unpredictably
Weak compositionality	Many edits interfere
Context dependence	Edits fail under rephrasing
Safety instability	Alignment behaviors shift unexpectedly
Scale challenges	Large models are difficult to analyze causally

Many current methods work well for small factual edits but struggle with broad conceptual changes.

Summary

Model editing modifies trained models to change specific behaviors while preserving most existing capabilities. Methods include fine-tuning, low-rank adaptation, rank-one updates, activation steering, and representation engineering.

A good edit should be reliable, local, stable, and efficient. The main challenge is interference: neural parameters are shared across many behaviors, so changing one behavior may unintentionally affect others.

Mechanistic interpretability provides tools for locating where knowledge and behavior are represented, making more targeted edits possible. Retrieval systems provide an alternative or complementary strategy by moving knowledge outside the model parameters.