Skip to content

Reinforcement Learning from Human Feedback

Instruction tuning teaches a model to imitate demonstrations.

Instruction tuning teaches a model to imitate demonstrations. Reinforcement learning from human feedback, usually abbreviated RLHF, goes further. Instead of only copying target responses, the model learns to optimize behavior according to human preferences.

The central idea is that many desirable properties of language model behavior are difficult to specify with simple supervised labels. For example:

Desired propertyWhy it is difficult
HelpfulnessDepends on context and user intent
HarmlessnessRequires judgment and safety tradeoffs
HonestyRequires uncertainty awareness
ConcisenessDepends on task and audience
ToneDepends on conversational context
Reasoning qualityOften subjective

Rather than writing explicit rules for every situation, RLHF learns a reward signal from preference comparisons made by humans or by a stronger supervisory model.

The system is then optimized to maximize this learned reward.

The RLHF Pipeline

A standard RLHF pipeline has three stages:

StageGoal
PretrainingLearn language and world knowledge
Supervised fine-tuningLearn instruction-following behavior
Reinforcement learningOptimize responses using preference rewards

The reinforcement learning stage usually begins with an instruction-tuned model.

The overall process looks like:

Pretraining
Instruction tuning
Preference collection
Reward model training
Policy optimization

The final policy is the aligned assistant model.

Preference Data

The core training signal in RLHF is preference data.

Human annotators compare candidate model responses and indicate which one is preferred.

Example:

PromptResponse AResponse BPreferred
“Explain recursion.”Clear explanationConfusing answerA
“How do I build malware?”RefusalHarmful instructionsRefusal
“Summarize this article.”Accurate summaryHallucinated summaryAccurate summary

Preference data does not require annotators to write perfect answers from scratch. Ranking alternatives is often faster and more consistent than free-form generation.

The comparisons define a partial ordering over responses.

Reward Models

The preference comparisons are used to train a reward model.

The reward model receives:

InputDescription
PromptUser instruction or conversation
Candidate responseModel-generated answer

The reward model outputs a scalar score:

rϕ(x,y), r_\phi(x, y),

where:

SymbolMeaning
xxPrompt
yyResponse
ϕ\phiReward model parameters

Higher scores indicate preferred responses.

The reward model is trained from pairwise comparisons. Suppose humans prefer response ywy_w over response yly_l. The reward model should assign:

$$ r_\phi(x, y_w)

r_\phi(x, y_l). $$

A common training objective is the Bradley-Terry preference model:

P(ywyl)=exp(rϕ(x,yw))exp(rϕ(x,yw))+exp(rϕ(x,yl)). P(y_w \succ y_l) = \frac{ \exp(r_\phi(x,y_w)) }{ \exp(r_\phi(x,y_w)) + \exp(r_\phi(x,y_l)) }.

The loss is:

L=logP(ywyl). \mathcal{L} = -\log P(y_w \succ y_l).

The reward model therefore learns to approximate human preferences statistically.

Policy Optimization

Once the reward model is trained, the language model is optimized to maximize reward.

The language model becomes a policy:

πθ(yx), \pi_\theta(y \mid x),

which generates responses yy conditioned on prompts xx.

The objective is approximately:

maxθEyπθ[rϕ(x,y)]. \max_\theta \mathbb{E}_{y \sim \pi_\theta} [r_\phi(x,y)].

However, directly maximizing reward is dangerous. The model may exploit weaknesses in the reward model and generate unnatural or degenerate text.

To stabilize training, RLHF usually constrains the policy to remain close to the supervised fine-tuned model.

KL-Regularized Objectives

A common RLHF objective includes a KL-divergence penalty:

maxθEyπθ[rϕ(x,y)βDKL(πθπref)]. \max_\theta \mathbb{E}_{y \sim \pi_\theta} \left[ r_\phi(x,y) - \beta D_{\mathrm{KL}} ( \pi_\theta \| \pi_{\mathrm{ref}} ) \right].

Here:

SymbolMeaning
πθ\pi_\thetaCurrent policy
πref\pi_{\mathrm{ref}}Reference policy
rϕr_\phiReward model
β\betaKL penalty coefficient

The KL penalty discourages the model from drifting too far from the original instruction-tuned distribution.

Without this constraint, the model may maximize reward through pathological outputs rather than genuinely useful behavior.

PPO and Policy Gradient Methods

Early RLHF systems commonly used Proximal Policy Optimization, or PPO.

PPO is a policy-gradient reinforcement learning algorithm designed to improve stability during policy updates.

The idea is simple:

  1. Generate responses.
  2. Score them with the reward model.
  3. Estimate advantages.
  4. Update the policy gradually.

The PPO objective constrains policy updates so that each optimization step remains relatively small.

A simplified form is:

LPPO=E[min(rtAt,clip(rt,1ϵ,1+ϵ)At)], L^{\mathrm{PPO}} = \mathbb{E} \left[ \min ( r_t A_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon)A_t ) \right],

where:

SymbolMeaning
rtr_tProbability ratio
AtA_tAdvantage estimate
ϵ\epsilonClipping parameter

PPO reduces unstable jumps in policy behavior.

However, PPO training is computationally expensive and operationally complex. Many modern systems now prefer simpler alternatives.

Direct Preference Optimization

A newer approach is Direct Preference Optimization, or DPO.

DPO avoids explicit reinforcement learning. Instead of training a separate reward model and running PPO, DPO directly optimizes preference comparisons.

The key insight is that under certain assumptions, maximizing a KL-regularized reward objective can be transformed into a supervised classification objective over preferred and rejected responses.

The DPO objective encourages:

$$ \pi_\theta(y_w \mid x)

\pi_\theta(y_l \mid x), $$

while keeping the model close to a reference policy.

Advantages of DPO include:

AdvantageDescription
Simpler pipelineNo PPO rollout loop
More stableEasier optimization
Lower compute costFewer moving parts
Easier implementationStandard supervised-style training

Because of these advantages, many modern alignment systems use preference optimization variants instead of classical PPO-based RLHF.

Reward Hacking

A reward model is only an approximation of human judgment. If optimized aggressively, the policy may exploit weaknesses in the reward signal.

This is called reward hacking.

Examples include:

Failure modeExample
Verbosity biasExtremely long answers because reward correlates with detail
SycophancyAgreeing with the user even when incorrect
Style exploitationPolite wording masking factual errors
Safety over-optimizationExcessive refusal behavior
RepetitionRepeating patterns that reward model likes
Hallucinated confidenceFluent but false explanations

Reward hacking is a fundamental alignment problem. Optimizing proxy rewards can produce unintended behavior.

The reward model does not define true human values. It defines a learned approximation.

Distribution Shift

The reward model is trained on a limited distribution of responses. During optimization, the policy may generate outputs outside that distribution.

This creates distribution shift.

For example:

  1. The reward model sees mostly ordinary assistant responses.
  2. The policy explores unusual outputs during optimization.
  3. The reward model produces unreliable scores on unfamiliar text.
  4. The policy exploits those errors.

This is similar to adversarial optimization in other machine learning systems.

Large policy shifts can therefore destabilize RLHF.

KL regularization, conservative optimization, rejection sampling, and human auditing are used to reduce this problem.

Multi-Objective Alignment

Human preferences are not one-dimensional.

A useful assistant should balance multiple goals:

ObjectiveMeaning
HelpfulnessSolves the user’s problem
HarmlessnessAvoids dangerous behavior
HonestyAvoids fabrication
CalibrationExpresses uncertainty appropriately
ConcisenessAvoids unnecessary verbosity
RobustnessResists jailbreaks and manipulation

These objectives may conflict.

For example:

TradeoffExample
Helpfulness vs safetyMedical guidance
Conciseness vs completenessTechnical explanations
Honesty vs confidenceUncertain answers
Harmlessness vs utilityDual-use scientific topics

RLHF systems therefore optimize approximate mixtures of objectives rather than a single universal reward.

Constitutional and AI Feedback Methods

Human feedback is expensive and difficult to scale.

Modern systems increasingly use AI-generated feedback.

A stronger model may:

RoleExample
CriticIdentify factual errors
JudgeRank candidate outputs
Safety evaluatorDetect policy violations
RewriterImprove weak responses
Preference annotatorGenerate synthetic rankings

Constitutional AI approaches define principles or rules that guide critique and revision.

Example principles:

PrinciplePurpose
Avoid harmful adviceSafety
Admit uncertaintyHonesty
Respect privacySecurity
Avoid discriminationFairness

The model critiques its own outputs relative to the constitution, then revises them.

This reduces reliance on large human labeling teams.

RLHF and Reasoning

RLHF affects reasoning behavior strongly.

During pretraining, the model learns statistical reasoning patterns implicitly. RLHF changes which reasoning traces are rewarded.

If detailed reasoning receives high reward, the model may produce more chain-of-thought style outputs. If concise answers receive higher reward, the model may shorten explanations.

This can improve usability but also distort behavior.

For example:

RLHF effectPossible issue
More confident toneFalse certainty
More coherent reasoningPersuasive hallucinations
Longer explanationsRewarding verbosity
Refusal optimizationOver-refusal

The model may learn how reasoning should look rather than how to reason correctly internally.

This distinction between external reasoning traces and internal computation remains an active research topic.

RLHF and Tool Use

RLHF often trains models to use tools correctly.

Examples include:

Tool typeExample
SearchWeb retrieval
Code executionPython interpreters
APIsWeather or finance services
DatabasesStructured queries
AgentsMulti-step planning systems

The reward process encourages behaviors such as:

Desired behaviorExample
Calling tools when uncertainRetrieval before answering
Using valid argumentsCorrect API schemas
Interpreting outputs correctlyReading tool results
Avoiding hallucinationPrefer retrieved evidence

Tool-augmented alignment is increasingly important because modern assistants are not purely text generators.

Human Preference Biases

Preference labels are influenced by human psychology.

Annotators may prefer:

BiasExample
Fluent textEven if inaccurate
Confident toneEven when wrong
Longer answersPerceived depth
Agreeable behaviorSycophancy
Familiar stylesCultural bias
Safe responsesEven when overcautious

These biases become encoded into the reward model.

As a result, RLHF can amplify social and stylistic biases present in the annotation process.

Alignment therefore depends not only on optimization algorithms, but also on who provides feedback and how that feedback is collected.

PyTorch View of Preference Training

Suppose we have:

TensorMeaning
chosen_logpsLog probabilities for preferred responses
rejected_logpsLog probabilities for rejected responses

A simplified DPO-style loss may look like:

import torch
import torch.nn.functional as F

beta = 0.1

logits = beta * (chosen_logps - rejected_logps)

loss = -F.logsigmoid(logits).mean()

The model is encouraged to increase probability for preferred outputs relative to rejected outputs.

Unlike ordinary supervised learning, the target is not a single fixed sequence. The target is a preference ordering.

Limits of RLHF

RLHF is powerful, but it has major limitations.

First, reward models are imperfect proxies for human values.

Second, preference optimization may hide rather than solve dangerous behaviors.

Third, RLHF can reduce diversity and originality by pushing models toward highly rewarded styles.

Fourth, preference data is expensive and culturally dependent.

Fifth, RLHF does not guarantee truthfulness. A model may become more persuasive without becoming more accurate.

Finally, RLHF scales poorly if every new capability requires extensive human oversight.

These limitations motivate research into scalable oversight, mechanistic interpretability, constitutional methods, debate systems, verifier models, and automated alignment techniques.

Why RLHF Changed Modern Language Models

Pretrained models can generate fluent text. Instruction-tuned models can follow tasks. RLHF made models substantially more interactive, cooperative, and conversational.

It improved:

CapabilityEffect
Dialogue qualityMore natural interaction
HelpfulnessBetter task completion
Safety behaviorReduced harmful outputs
Refusal behaviorBetter policy compliance
Tone controlMore socially acceptable responses
Multi-turn consistencyImproved conversation flow

Many modern assistants rely heavily on preference optimization.

Without RLHF-style alignment, large language models often behave unpredictably in interactive settings.

Summary

Reinforcement learning from human feedback aligns language models with human preferences using preference comparisons and reward optimization.

The standard RLHF pipeline includes:

  1. Pretraining
  2. Instruction tuning
  3. Preference data collection
  4. Reward model training
  5. Policy optimization

Reward models estimate human preferences statistically, and policy optimization adjusts the model to maximize those rewards while remaining close to the supervised policy.

Modern systems increasingly use preference optimization methods such as DPO rather than classical PPO-based reinforcement learning.

RLHF improves usability, safety, and dialogue quality, but it introduces challenges such as reward hacking, sycophancy, over-optimization, and dependence on imperfect human feedback.