Reinforcement Learning Overview

Reinforcement learning studies how an agent learns to act through interaction with an environment. Unlike supervised learning, the agent does not receive a correct action for every situation. Instead, it receives feedback through rewards.

The basic interaction is:

\text{agent} \longrightarrow \text{action} \longrightarrow \text{environment} \longrightarrow \text{reward and next state}.

The agent’s goal is to choose actions that maximize total reward over time.

Agents, Environments, and Rewards

A reinforcement learning problem contains three main objects.

Object	Meaning
Agent	The learner or decision maker
Environment	The world the agent interacts with
Reward	A scalar feedback signal

At time step $t$ , the agent observes a state:

s_t.

It chooses an action:

a_t.

The environment returns a reward:

r_t,

and moves to a new state:

s_{t+1}.

This process repeats over many steps.

Examples:

Domain	State	Action	Reward
Game playing	Board or screen	Move	Win, score, progress
Robotics	Sensor readings	Motor command	Task success, stability
Recommendation	User context	Item to show	Click, purchase
Dialogue	Conversation history	Response	Human preference
Control	System state	Control input	Efficiency, safety

The reward is the only direct training signal. It defines what the agent is trying to optimize.

Policies

A policy defines how the agent selects actions.

A deterministic policy maps each state to one action:

a = \pi(s).

A stochastic policy defines a probability distribution over actions:

\pi(a \mid s).

For example, in a game, a policy may assign probabilities to all legal moves. In a language model, a policy may assign probabilities to possible next tokens.

In deep reinforcement learning, the policy is often represented by a neural network:

\pi_\theta(a \mid s),

where $\theta$ denotes model parameters.

The policy is the agent’s behavior. Training changes the policy so that better actions become more likely.

Returns

A single reward may not measure the quality of an action. Many actions are useful only because they lead to later rewards.

The return is the total discounted reward from time $t$ :

G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots.

Here $\gamma\in[0,1]$ is the discount factor.

If $\gamma$ is close to 0, the agent focuses on immediate reward. If $\gamma$ is close to 1, the agent cares more about long-term outcomes.

The goal is to maximize expected return:

J(\theta) = \mathbb{E}_{\pi_\theta}[G_t].

Value Functions

A value function estimates how good a state or action is.

The state-value function is

V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t=s].

It measures the expected return when starting from state $s$ and following policy $\pi$ .

The action-value function is

Q^\pi(s,a) = \mathbb{E}_\pi[G_t \mid s_t=s, a_t=a].

It measures the expected return after taking action $a$ in state $s$ , then following policy $\pi$ .

Value functions help the agent evaluate choices. If the agent knows $Q(s,a)$ , it can choose actions with high value.

Markov Decision Processes

Many reinforcement learning problems are modeled as Markov decision processes, or MDPs.

An MDP contains:

Component	Meaning
$\mathcal{S}$	State space
$\mathcal{A}$	Action space
$P(s' \mid s,a)$	Transition probability
$R(s,a)$	Reward function
$\gamma$	Discount factor

The Markov property means that the future depends on the current state and action, not on the full past history:

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots) = P(s_{t+1} \mid s_t, a_t).

This assumption makes the problem mathematically tractable. In practice, many real systems only approximately satisfy it.

Exploration and Exploitation

Reinforcement learning must balance exploration and exploitation.

Exploration means trying actions to learn about their consequences. Exploitation means choosing actions that currently seem best.

If the agent explores too little, it may never discover better strategies. If it explores too much, it may waste time on poor actions.

A simple exploration strategy is epsilon-greedy action selection. With probability $\epsilon$ , the agent chooses a random action. With probability $1-\epsilon$ , it chooses the best known action.

import random

def epsilon_greedy(q_values, epsilon):
    if random.random() < epsilon:
        return random.randrange(len(q_values))
    return q_values.argmax().item()

Exploration is one reason reinforcement learning is harder than supervised learning. The agent’s data distribution depends on its own behavior.

Model-Free and Model-Based Learning

Reinforcement learning methods are often divided into model-free and model-based methods.

Method family	Core idea
Model-free RL	Learn policy or value directly from experience
Model-based RL	Learn or use a model of environment dynamics

A model-free method may learn $Q(s,a)$ or $\pi(a\mid s)$ without explicitly predicting the next state.

A model-based method estimates transition dynamics:

\hat{P}(s' \mid s,a),

and possibly rewards:

\hat{R}(s,a).

The agent can then plan by simulating possible futures.

Model-based methods can be more sample efficient, but they suffer when the learned model is inaccurate.

Policy-Based and Value-Based Methods

Another classification is policy-based versus value-based methods.

Value-based methods learn a value function and choose actions from it. Q-learning is the classic example.

Policy-based methods directly optimize the policy parameters. Policy gradient methods belong to this family.

Actor-critic methods combine both ideas. The actor learns the policy. The critic learns a value function that guides the actor.

Method	Learns
Q-learning	Action-value function
Policy gradient	Policy
Actor-critic	Policy and value function

Deep reinforcement learning uses neural networks to represent these functions.

Reinforcement Learning and Deep Learning

Deep learning becomes useful when states, actions, or policies are too complex for tables.

For example, in an Atari game, the state may be raw pixels:

s_t \in \mathbb{R}^{C\times H\times W}.

A neural network can map the image to action values:

Q_\theta(s_t,a).

In robotics, a model may process camera images, joint positions, force readings, and task goals.

In language modeling, a policy may be a transformer that selects the next token. Preference-based training can treat generated text as actions and human preference as reward.

Deep RL therefore combines representation learning with sequential decision making.

Reinforcement Learning in PyTorch

A reinforcement learning loop differs from standard supervised training because data is generated by interaction.

A simplified loop is:

for episode in range(num_episodes):
    state = env.reset()

    done = False
    while not done:
        state_tensor = torch.tensor(state).float().unsqueeze(0)

        action = policy.select_action(state_tensor)

        next_state, reward, done, info = env.step(action)

        replay_buffer.add(state, action, reward, next_state, done)

        state = next_state

        if replay_buffer.ready():
            batch = replay_buffer.sample()

            loss = compute_rl_loss(batch)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

The training data comes from the agent’s own behavior. This makes RL data non-stationary: as the policy changes, the data distribution changes too.

Why Reinforcement Learning Is Difficult

Reinforcement learning is difficult for several reasons.

The reward may be sparse. The agent may need thousands of steps before receiving useful feedback.

Credit assignment is hard. A reward at the end of an episode may depend on many earlier actions.

Data is correlated because nearby transitions come from the same trajectory.

Training can be unstable because the policy affects the data used to train the policy.

Exploration can be unsafe or expensive in real-world systems.

For these reasons, reinforcement learning often requires careful algorithm design, good simulation environments, and strong evaluation protocols.

Summary

Reinforcement learning studies agents that learn from interaction. The agent observes states, takes actions, receives rewards, and tries to maximize long-term return.

The main concepts are policies, rewards, returns, value functions, Markov decision processes, exploration, and exploitation.

Deep reinforcement learning uses neural networks to represent policies, value functions, or environment models. It is important for games, robotics, control, recommendation systems, and preference-based training of language models.