Reinforcement learning studies how an agent learns to act through interaction with an environment.
Reinforcement learning studies how an agent learns to act through interaction with an environment. Unlike supervised learning, the agent does not receive a correct action for every situation. Instead, it receives feedback through rewards.
The basic interaction is:
The agent’s goal is to choose actions that maximize total reward over time.
Agents, Environments, and Rewards
A reinforcement learning problem contains three main objects.
| Object | Meaning |
|---|---|
| Agent | The learner or decision maker |
| Environment | The world the agent interacts with |
| Reward | A scalar feedback signal |
At time step , the agent observes a state:
It chooses an action:
The environment returns a reward:
and moves to a new state:
This process repeats over many steps.
Examples:
| Domain | State | Action | Reward |
|---|---|---|---|
| Game playing | Board or screen | Move | Win, score, progress |
| Robotics | Sensor readings | Motor command | Task success, stability |
| Recommendation | User context | Item to show | Click, purchase |
| Dialogue | Conversation history | Response | Human preference |
| Control | System state | Control input | Efficiency, safety |
The reward is the only direct training signal. It defines what the agent is trying to optimize.
Policies
A policy defines how the agent selects actions.
A deterministic policy maps each state to one action:
A stochastic policy defines a probability distribution over actions:
For example, in a game, a policy may assign probabilities to all legal moves. In a language model, a policy may assign probabilities to possible next tokens.
In deep reinforcement learning, the policy is often represented by a neural network:
where denotes model parameters.
The policy is the agent’s behavior. Training changes the policy so that better actions become more likely.
Returns
A single reward may not measure the quality of an action. Many actions are useful only because they lead to later rewards.
The return is the total discounted reward from time :
Here is the discount factor.
If is close to 0, the agent focuses on immediate reward. If is close to 1, the agent cares more about long-term outcomes.
The goal is to maximize expected return:
Value Functions
A value function estimates how good a state or action is.
The state-value function is
It measures the expected return when starting from state and following policy .
The action-value function is
It measures the expected return after taking action in state , then following policy .
Value functions help the agent evaluate choices. If the agent knows , it can choose actions with high value.
Markov Decision Processes
Many reinforcement learning problems are modeled as Markov decision processes, or MDPs.
An MDP contains:
| Component | Meaning |
|---|---|
| State space | |
| Action space | |
| Transition probability | |
| Reward function | |
| Discount factor |
The Markov property means that the future depends on the current state and action, not on the full past history:
This assumption makes the problem mathematically tractable. In practice, many real systems only approximately satisfy it.
Exploration and Exploitation
Reinforcement learning must balance exploration and exploitation.
Exploration means trying actions to learn about their consequences. Exploitation means choosing actions that currently seem best.
If the agent explores too little, it may never discover better strategies. If it explores too much, it may waste time on poor actions.
A simple exploration strategy is epsilon-greedy action selection. With probability , the agent chooses a random action. With probability , it chooses the best known action.
import random
def epsilon_greedy(q_values, epsilon):
if random.random() < epsilon:
return random.randrange(len(q_values))
return q_values.argmax().item()Exploration is one reason reinforcement learning is harder than supervised learning. The agent’s data distribution depends on its own behavior.
Model-Free and Model-Based Learning
Reinforcement learning methods are often divided into model-free and model-based methods.
| Method family | Core idea |
|---|---|
| Model-free RL | Learn policy or value directly from experience |
| Model-based RL | Learn or use a model of environment dynamics |
A model-free method may learn or without explicitly predicting the next state.
A model-based method estimates transition dynamics:
and possibly rewards:
The agent can then plan by simulating possible futures.
Model-based methods can be more sample efficient, but they suffer when the learned model is inaccurate.
Policy-Based and Value-Based Methods
Another classification is policy-based versus value-based methods.
Value-based methods learn a value function and choose actions from it. Q-learning is the classic example.
Policy-based methods directly optimize the policy parameters. Policy gradient methods belong to this family.
Actor-critic methods combine both ideas. The actor learns the policy. The critic learns a value function that guides the actor.
| Method | Learns |
|---|---|
| Q-learning | Action-value function |
| Policy gradient | Policy |
| Actor-critic | Policy and value function |
Deep reinforcement learning uses neural networks to represent these functions.
Reinforcement Learning and Deep Learning
Deep learning becomes useful when states, actions, or policies are too complex for tables.
For example, in an Atari game, the state may be raw pixels:
A neural network can map the image to action values:
In robotics, a model may process camera images, joint positions, force readings, and task goals.
In language modeling, a policy may be a transformer that selects the next token. Preference-based training can treat generated text as actions and human preference as reward.
Deep RL therefore combines representation learning with sequential decision making.
Reinforcement Learning in PyTorch
A reinforcement learning loop differs from standard supervised training because data is generated by interaction.
A simplified loop is:
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
state_tensor = torch.tensor(state).float().unsqueeze(0)
action = policy.select_action(state_tensor)
next_state, reward, done, info = env.step(action)
replay_buffer.add(state, action, reward, next_state, done)
state = next_state
if replay_buffer.ready():
batch = replay_buffer.sample()
loss = compute_rl_loss(batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()The training data comes from the agent’s own behavior. This makes RL data non-stationary: as the policy changes, the data distribution changes too.
Why Reinforcement Learning Is Difficult
Reinforcement learning is difficult for several reasons.
The reward may be sparse. The agent may need thousands of steps before receiving useful feedback.
Credit assignment is hard. A reward at the end of an episode may depend on many earlier actions.
Data is correlated because nearby transitions come from the same trajectory.
Training can be unstable because the policy affects the data used to train the policy.
Exploration can be unsafe or expensive in real-world systems.
For these reasons, reinforcement learning often requires careful algorithm design, good simulation environments, and strong evaluation protocols.
Summary
Reinforcement learning studies agents that learn from interaction. The agent observes states, takes actions, receives rewards, and tries to maximize long-term return.
The main concepts are policies, rewards, returns, value functions, Markov decision processes, exploration, and exploitation.
Deep reinforcement learning uses neural networks to represent policies, value functions, or environment models. It is important for games, robotics, control, recommendation systems, and preference-based training of language models.