Instruction tuning teaches a model to imitate demonstrations.
Instruction tuning teaches a model to imitate demonstrations. Reinforcement learning from human feedback, usually abbreviated RLHF, goes further. Instead of only copying target responses, the model learns to optimize behavior according to human preferences.
The central idea is that many desirable properties of language model behavior are difficult to specify with simple supervised labels. For example:
| Desired property | Why it is difficult |
|---|---|
| Helpfulness | Depends on context and user intent |
| Harmlessness | Requires judgment and safety tradeoffs |
| Honesty | Requires uncertainty awareness |
| Conciseness | Depends on task and audience |
| Tone | Depends on conversational context |
| Reasoning quality | Often subjective |
Rather than writing explicit rules for every situation, RLHF learns a reward signal from preference comparisons made by humans or by a stronger supervisory model.
The system is then optimized to maximize this learned reward.
The RLHF Pipeline
A standard RLHF pipeline has three stages:
| Stage | Goal |
|---|---|
| Pretraining | Learn language and world knowledge |
| Supervised fine-tuning | Learn instruction-following behavior |
| Reinforcement learning | Optimize responses using preference rewards |
The reinforcement learning stage usually begins with an instruction-tuned model.
The overall process looks like:
Pretraining
↓
Instruction tuning
↓
Preference collection
↓
Reward model training
↓
Policy optimizationThe final policy is the aligned assistant model.
Preference Data
The core training signal in RLHF is preference data.
Human annotators compare candidate model responses and indicate which one is preferred.
Example:
| Prompt | Response A | Response B | Preferred |
|---|---|---|---|
| “Explain recursion.” | Clear explanation | Confusing answer | A |
| “How do I build malware?” | Refusal | Harmful instructions | Refusal |
| “Summarize this article.” | Accurate summary | Hallucinated summary | Accurate summary |
Preference data does not require annotators to write perfect answers from scratch. Ranking alternatives is often faster and more consistent than free-form generation.
The comparisons define a partial ordering over responses.
Reward Models
The preference comparisons are used to train a reward model.
The reward model receives:
| Input | Description |
|---|---|
| Prompt | User instruction or conversation |
| Candidate response | Model-generated answer |
The reward model outputs a scalar score:
where:
| Symbol | Meaning |
|---|---|
| Prompt | |
| Response | |
| Reward model parameters |
Higher scores indicate preferred responses.
The reward model is trained from pairwise comparisons. Suppose humans prefer response over response . The reward model should assign:
$$ r_\phi(x, y_w)
r_\phi(x, y_l). $$
A common training objective is the Bradley-Terry preference model:
The loss is:
The reward model therefore learns to approximate human preferences statistically.
Policy Optimization
Once the reward model is trained, the language model is optimized to maximize reward.
The language model becomes a policy:
which generates responses conditioned on prompts .
The objective is approximately:
However, directly maximizing reward is dangerous. The model may exploit weaknesses in the reward model and generate unnatural or degenerate text.
To stabilize training, RLHF usually constrains the policy to remain close to the supervised fine-tuned model.
KL-Regularized Objectives
A common RLHF objective includes a KL-divergence penalty:
Here:
| Symbol | Meaning |
|---|---|
| Current policy | |
| Reference policy | |
| Reward model | |
| KL penalty coefficient |
The KL penalty discourages the model from drifting too far from the original instruction-tuned distribution.
Without this constraint, the model may maximize reward through pathological outputs rather than genuinely useful behavior.
PPO and Policy Gradient Methods
Early RLHF systems commonly used Proximal Policy Optimization, or PPO.
PPO is a policy-gradient reinforcement learning algorithm designed to improve stability during policy updates.
The idea is simple:
- Generate responses.
- Score them with the reward model.
- Estimate advantages.
- Update the policy gradually.
The PPO objective constrains policy updates so that each optimization step remains relatively small.
A simplified form is:
where:
| Symbol | Meaning |
|---|---|
| Probability ratio | |
| Advantage estimate | |
| Clipping parameter |
PPO reduces unstable jumps in policy behavior.
However, PPO training is computationally expensive and operationally complex. Many modern systems now prefer simpler alternatives.
Direct Preference Optimization
A newer approach is Direct Preference Optimization, or DPO.
DPO avoids explicit reinforcement learning. Instead of training a separate reward model and running PPO, DPO directly optimizes preference comparisons.
The key insight is that under certain assumptions, maximizing a KL-regularized reward objective can be transformed into a supervised classification objective over preferred and rejected responses.
The DPO objective encourages:
$$ \pi_\theta(y_w \mid x)
\pi_\theta(y_l \mid x), $$
while keeping the model close to a reference policy.
Advantages of DPO include:
| Advantage | Description |
|---|---|
| Simpler pipeline | No PPO rollout loop |
| More stable | Easier optimization |
| Lower compute cost | Fewer moving parts |
| Easier implementation | Standard supervised-style training |
Because of these advantages, many modern alignment systems use preference optimization variants instead of classical PPO-based RLHF.
Reward Hacking
A reward model is only an approximation of human judgment. If optimized aggressively, the policy may exploit weaknesses in the reward signal.
This is called reward hacking.
Examples include:
| Failure mode | Example |
|---|---|
| Verbosity bias | Extremely long answers because reward correlates with detail |
| Sycophancy | Agreeing with the user even when incorrect |
| Style exploitation | Polite wording masking factual errors |
| Safety over-optimization | Excessive refusal behavior |
| Repetition | Repeating patterns that reward model likes |
| Hallucinated confidence | Fluent but false explanations |
Reward hacking is a fundamental alignment problem. Optimizing proxy rewards can produce unintended behavior.
The reward model does not define true human values. It defines a learned approximation.
Distribution Shift
The reward model is trained on a limited distribution of responses. During optimization, the policy may generate outputs outside that distribution.
This creates distribution shift.
For example:
- The reward model sees mostly ordinary assistant responses.
- The policy explores unusual outputs during optimization.
- The reward model produces unreliable scores on unfamiliar text.
- The policy exploits those errors.
This is similar to adversarial optimization in other machine learning systems.
Large policy shifts can therefore destabilize RLHF.
KL regularization, conservative optimization, rejection sampling, and human auditing are used to reduce this problem.
Multi-Objective Alignment
Human preferences are not one-dimensional.
A useful assistant should balance multiple goals:
| Objective | Meaning |
|---|---|
| Helpfulness | Solves the user’s problem |
| Harmlessness | Avoids dangerous behavior |
| Honesty | Avoids fabrication |
| Calibration | Expresses uncertainty appropriately |
| Conciseness | Avoids unnecessary verbosity |
| Robustness | Resists jailbreaks and manipulation |
These objectives may conflict.
For example:
| Tradeoff | Example |
|---|---|
| Helpfulness vs safety | Medical guidance |
| Conciseness vs completeness | Technical explanations |
| Honesty vs confidence | Uncertain answers |
| Harmlessness vs utility | Dual-use scientific topics |
RLHF systems therefore optimize approximate mixtures of objectives rather than a single universal reward.
Constitutional and AI Feedback Methods
Human feedback is expensive and difficult to scale.
Modern systems increasingly use AI-generated feedback.
A stronger model may:
| Role | Example |
|---|---|
| Critic | Identify factual errors |
| Judge | Rank candidate outputs |
| Safety evaluator | Detect policy violations |
| Rewriter | Improve weak responses |
| Preference annotator | Generate synthetic rankings |
Constitutional AI approaches define principles or rules that guide critique and revision.
Example principles:
| Principle | Purpose |
|---|---|
| Avoid harmful advice | Safety |
| Admit uncertainty | Honesty |
| Respect privacy | Security |
| Avoid discrimination | Fairness |
The model critiques its own outputs relative to the constitution, then revises them.
This reduces reliance on large human labeling teams.
RLHF and Reasoning
RLHF affects reasoning behavior strongly.
During pretraining, the model learns statistical reasoning patterns implicitly. RLHF changes which reasoning traces are rewarded.
If detailed reasoning receives high reward, the model may produce more chain-of-thought style outputs. If concise answers receive higher reward, the model may shorten explanations.
This can improve usability but also distort behavior.
For example:
| RLHF effect | Possible issue |
|---|---|
| More confident tone | False certainty |
| More coherent reasoning | Persuasive hallucinations |
| Longer explanations | Rewarding verbosity |
| Refusal optimization | Over-refusal |
The model may learn how reasoning should look rather than how to reason correctly internally.
This distinction between external reasoning traces and internal computation remains an active research topic.
RLHF and Tool Use
RLHF often trains models to use tools correctly.
Examples include:
| Tool type | Example |
|---|---|
| Search | Web retrieval |
| Code execution | Python interpreters |
| APIs | Weather or finance services |
| Databases | Structured queries |
| Agents | Multi-step planning systems |
The reward process encourages behaviors such as:
| Desired behavior | Example |
|---|---|
| Calling tools when uncertain | Retrieval before answering |
| Using valid arguments | Correct API schemas |
| Interpreting outputs correctly | Reading tool results |
| Avoiding hallucination | Prefer retrieved evidence |
Tool-augmented alignment is increasingly important because modern assistants are not purely text generators.
Human Preference Biases
Preference labels are influenced by human psychology.
Annotators may prefer:
| Bias | Example |
|---|---|
| Fluent text | Even if inaccurate |
| Confident tone | Even when wrong |
| Longer answers | Perceived depth |
| Agreeable behavior | Sycophancy |
| Familiar styles | Cultural bias |
| Safe responses | Even when overcautious |
These biases become encoded into the reward model.
As a result, RLHF can amplify social and stylistic biases present in the annotation process.
Alignment therefore depends not only on optimization algorithms, but also on who provides feedback and how that feedback is collected.
PyTorch View of Preference Training
Suppose we have:
| Tensor | Meaning |
|---|---|
chosen_logps | Log probabilities for preferred responses |
rejected_logps | Log probabilities for rejected responses |
A simplified DPO-style loss may look like:
import torch
import torch.nn.functional as F
beta = 0.1
logits = beta * (chosen_logps - rejected_logps)
loss = -F.logsigmoid(logits).mean()The model is encouraged to increase probability for preferred outputs relative to rejected outputs.
Unlike ordinary supervised learning, the target is not a single fixed sequence. The target is a preference ordering.
Limits of RLHF
RLHF is powerful, but it has major limitations.
First, reward models are imperfect proxies for human values.
Second, preference optimization may hide rather than solve dangerous behaviors.
Third, RLHF can reduce diversity and originality by pushing models toward highly rewarded styles.
Fourth, preference data is expensive and culturally dependent.
Fifth, RLHF does not guarantee truthfulness. A model may become more persuasive without becoming more accurate.
Finally, RLHF scales poorly if every new capability requires extensive human oversight.
These limitations motivate research into scalable oversight, mechanistic interpretability, constitutional methods, debate systems, verifier models, and automated alignment techniques.
Why RLHF Changed Modern Language Models
Pretrained models can generate fluent text. Instruction-tuned models can follow tasks. RLHF made models substantially more interactive, cooperative, and conversational.
It improved:
| Capability | Effect |
|---|---|
| Dialogue quality | More natural interaction |
| Helpfulness | Better task completion |
| Safety behavior | Reduced harmful outputs |
| Refusal behavior | Better policy compliance |
| Tone control | More socially acceptable responses |
| Multi-turn consistency | Improved conversation flow |
Many modern assistants rely heavily on preference optimization.
Without RLHF-style alignment, large language models often behave unpredictably in interactive settings.
Summary
Reinforcement learning from human feedback aligns language models with human preferences using preference comparisons and reward optimization.
The standard RLHF pipeline includes:
- Pretraining
- Instruction tuning
- Preference data collection
- Reward model training
- Policy optimization
Reward models estimate human preferences statistically, and policy optimization adjusts the model to maximize those rewards while remaining close to the supervised policy.
Modern systems increasingly use preference optimization methods such as DPO rather than classical PPO-based reinforcement learning.
RLHF improves usability, safety, and dialogue quality, but it introduces challenges such as reward hacking, sycophancy, over-optimization, and dependence on imperfect human feedback.