Skip to content

Reinforcement Learning Overview

Reinforcement learning studies how an agent learns to act through interaction with an environment.

Reinforcement learning studies how an agent learns to act through interaction with an environment. Unlike supervised learning, the agent does not receive a correct action for every situation. Instead, it receives feedback through rewards.

The basic interaction is:

agentactionenvironmentreward and next state. \text{agent} \longrightarrow \text{action} \longrightarrow \text{environment} \longrightarrow \text{reward and next state}.

The agent’s goal is to choose actions that maximize total reward over time.

Agents, Environments, and Rewards

A reinforcement learning problem contains three main objects.

ObjectMeaning
AgentThe learner or decision maker
EnvironmentThe world the agent interacts with
RewardA scalar feedback signal

At time step tt, the agent observes a state:

st. s_t.

It chooses an action:

at. a_t.

The environment returns a reward:

rt, r_t,

and moves to a new state:

st+1. s_{t+1}.

This process repeats over many steps.

Examples:

DomainStateActionReward
Game playingBoard or screenMoveWin, score, progress
RoboticsSensor readingsMotor commandTask success, stability
RecommendationUser contextItem to showClick, purchase
DialogueConversation historyResponseHuman preference
ControlSystem stateControl inputEfficiency, safety

The reward is the only direct training signal. It defines what the agent is trying to optimize.

Policies

A policy defines how the agent selects actions.

A deterministic policy maps each state to one action:

a=π(s). a = \pi(s).

A stochastic policy defines a probability distribution over actions:

π(as). \pi(a \mid s).

For example, in a game, a policy may assign probabilities to all legal moves. In a language model, a policy may assign probabilities to possible next tokens.

In deep reinforcement learning, the policy is often represented by a neural network:

πθ(as), \pi_\theta(a \mid s),

where θ\theta denotes model parameters.

The policy is the agent’s behavior. Training changes the policy so that better actions become more likely.

Returns

A single reward may not measure the quality of an action. Many actions are useful only because they lead to later rewards.

The return is the total discounted reward from time tt:

Gt=rt+γrt+1+γ2rt+2+. G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots.

Here γ[0,1]\gamma\in[0,1] is the discount factor.

If γ\gamma is close to 0, the agent focuses on immediate reward. If γ\gamma is close to 1, the agent cares more about long-term outcomes.

The goal is to maximize expected return:

J(θ)=Eπθ[Gt]. J(\theta) = \mathbb{E}_{\pi_\theta}[G_t].

Value Functions

A value function estimates how good a state or action is.

The state-value function is

Vπ(s)=Eπ[Gtst=s]. V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t=s].

It measures the expected return when starting from state ss and following policy π\pi.

The action-value function is

Qπ(s,a)=Eπ[Gtst=s,at=a]. Q^\pi(s,a) = \mathbb{E}_\pi[G_t \mid s_t=s, a_t=a].

It measures the expected return after taking action aa in state ss, then following policy π\pi.

Value functions help the agent evaluate choices. If the agent knows Q(s,a)Q(s,a), it can choose actions with high value.

Markov Decision Processes

Many reinforcement learning problems are modeled as Markov decision processes, or MDPs.

An MDP contains:

ComponentMeaning
S\mathcal{S}State space
A\mathcal{A}Action space
P(ss,a)P(s' \mid s,a)Transition probability
R(s,a)R(s,a)Reward function
γ\gammaDiscount factor

The Markov property means that the future depends on the current state and action, not on the full past history:

P(st+1st,at,st1,at1,)=P(st+1st,at). P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1} \mid s_t, a_t).

This assumption makes the problem mathematically tractable. In practice, many real systems only approximately satisfy it.

Exploration and Exploitation

Reinforcement learning must balance exploration and exploitation.

Exploration means trying actions to learn about their consequences. Exploitation means choosing actions that currently seem best.

If the agent explores too little, it may never discover better strategies. If it explores too much, it may waste time on poor actions.

A simple exploration strategy is epsilon-greedy action selection. With probability ϵ\epsilon, the agent chooses a random action. With probability 1ϵ1-\epsilon, it chooses the best known action.

import random

def epsilon_greedy(q_values, epsilon):
    if random.random() < epsilon:
        return random.randrange(len(q_values))
    return q_values.argmax().item()

Exploration is one reason reinforcement learning is harder than supervised learning. The agent’s data distribution depends on its own behavior.

Model-Free and Model-Based Learning

Reinforcement learning methods are often divided into model-free and model-based methods.

Method familyCore idea
Model-free RLLearn policy or value directly from experience
Model-based RLLearn or use a model of environment dynamics

A model-free method may learn Q(s,a)Q(s,a) or π(as)\pi(a\mid s) without explicitly predicting the next state.

A model-based method estimates transition dynamics:

P^(ss,a), \hat{P}(s' \mid s,a),

and possibly rewards:

R^(s,a). \hat{R}(s,a).

The agent can then plan by simulating possible futures.

Model-based methods can be more sample efficient, but they suffer when the learned model is inaccurate.

Policy-Based and Value-Based Methods

Another classification is policy-based versus value-based methods.

Value-based methods learn a value function and choose actions from it. Q-learning is the classic example.

Policy-based methods directly optimize the policy parameters. Policy gradient methods belong to this family.

Actor-critic methods combine both ideas. The actor learns the policy. The critic learns a value function that guides the actor.

MethodLearns
Q-learningAction-value function
Policy gradientPolicy
Actor-criticPolicy and value function

Deep reinforcement learning uses neural networks to represent these functions.

Reinforcement Learning and Deep Learning

Deep learning becomes useful when states, actions, or policies are too complex for tables.

For example, in an Atari game, the state may be raw pixels:

stRC×H×W. s_t \in \mathbb{R}^{C\times H\times W}.

A neural network can map the image to action values:

Qθ(st,a). Q_\theta(s_t,a).

In robotics, a model may process camera images, joint positions, force readings, and task goals.

In language modeling, a policy may be a transformer that selects the next token. Preference-based training can treat generated text as actions and human preference as reward.

Deep RL therefore combines representation learning with sequential decision making.

Reinforcement Learning in PyTorch

A reinforcement learning loop differs from standard supervised training because data is generated by interaction.

A simplified loop is:

for episode in range(num_episodes):
    state = env.reset()

    done = False
    while not done:
        state_tensor = torch.tensor(state).float().unsqueeze(0)

        action = policy.select_action(state_tensor)

        next_state, reward, done, info = env.step(action)

        replay_buffer.add(state, action, reward, next_state, done)

        state = next_state

        if replay_buffer.ready():
            batch = replay_buffer.sample()

            loss = compute_rl_loss(batch)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

The training data comes from the agent’s own behavior. This makes RL data non-stationary: as the policy changes, the data distribution changes too.

Why Reinforcement Learning Is Difficult

Reinforcement learning is difficult for several reasons.

The reward may be sparse. The agent may need thousands of steps before receiving useful feedback.

Credit assignment is hard. A reward at the end of an episode may depend on many earlier actions.

Data is correlated because nearby transitions come from the same trajectory.

Training can be unstable because the policy affects the data used to train the policy.

Exploration can be unsafe or expensive in real-world systems.

For these reasons, reinforcement learning often requires careful algorithm design, good simulation environments, and strong evaluation protocols.

Summary

Reinforcement learning studies agents that learn from interaction. The agent observes states, takes actions, receives rewards, and tries to maximize long-term return.

The main concepts are policies, rewards, returns, value functions, Markov decision processes, exploration, and exploitation.

Deep reinforcement learning uses neural networks to represent policies, value functions, or environment models. It is important for games, robotics, control, recommendation systems, and preference-based training of language models.