Reinforcement Learning

Reinforcement learning studies learning systems that act in an environment. Unlike supervised learning, the training signal is not a target label for each input. The model chooses actions, the environment responds, and rewards may arrive later.

A reinforcement learning problem is usually described as a Markov decision process. At time $t$ , the agent observes a state $s_t$ , chooses an action $a_t$ , receives a reward $r_t$ , and moves to a new state $s_{t+1}$ .

s_t \to a_t \to r_t, s_{t+1}

The objective is to maximize expected cumulative reward:

J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T} \gamma^t r_t \right].

Here, $\gamma$ is the discount factor. It controls how much future rewards matter relative to immediate rewards.

Automatic differentiation is useful in reinforcement learning, but its role is less direct than in supervised learning. In supervised learning, the loss is usually a differentiable program of model parameters. In reinforcement learning, actions may be sampled, environments may be nondifferentiable, and rewards may depend on long trajectories.

Policies

A policy maps states to actions. A stochastic policy defines a distribution:

\pi_\theta(a \mid s).

The policy parameters $\theta$ are usually represented by a neural network. Given a state, the network outputs action probabilities, action logits, or parameters of a continuous distribution.

For a discrete action space:

\pi_\theta(a \mid s) = \operatorname{softmax}(f_\theta(s))_a.

For a continuous action space, the policy might output a mean and variance:

a \sim \mathcal{N}(\mu_\theta(s), \Sigma_\theta(s)).

AD can compute derivatives of the policy network outputs with respect to $\theta$ . The harder question is how to connect those derivatives to expected reward.

The Policy Gradient

The policy gradient theorem gives a way to differentiate expected reward without differentiating through the environment transition dynamics.

A common form is:

\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) G_t \right],

where $G_t$ is a return estimate from time $t$ , such as:

G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k.

This estimator uses the derivative of the log probability of the sampled action. The environment can remain a black box.

AD computes:

\nabla_\theta \log \pi_\theta(a_t \mid s_t).

The reward-weighted estimator is then assembled by the RL algorithm.

REINFORCE

The simplest policy gradient method is often called REINFORCE. It samples trajectories, computes returns, and updates the policy in the direction that increases log probability of actions that led to high return.

A loss form commonly used for minimization is:

L(\theta) = - \sum_t G_t \log \pi_\theta(a_t \mid s_t).

Then ordinary reverse mode AD computes:

\nabla_\theta L(\theta).

The negative sign appears because optimizers usually minimize loss, while RL maximizes return.

A minimal implementation looks like:

trajectory = run_policy_in_environment(policy)

returns = compute_discounted_returns(trajectory.rewards)

loss = 0
for each step t:
    logp = log_probability(policy, state[t], action[t])
    loss += -returns[t] * logp

backward(loss)
optimizer_step()

This is a good example of AD used inside a larger estimator. AD differentiates the log-probability computation. It does not differentiate through the sampled action as a discrete choice.

Baselines and Advantage Functions

Policy gradient estimators can have high variance. A baseline reduces variance without changing the expected gradient.

Instead of weighting by return $G_t$ , use an advantage estimate:

A_t = G_t - b(s_t).

The baseline $b(s_t)$ is often a value function:

V_\phi(s_t) \approx \mathbb{E}[G_t \mid s_t].

The policy loss becomes:

L_{\text{policy}} = - \sum_t A_t \log \pi_\theta(a_t \mid s_t).

The value function is trained with a regression loss:

L_{\text{value}} = \sum_t \left( V_\phi(s_t) - G_t \right)^2.

AD computes gradients for both networks. The policy gradient uses log probabilities. The value loss is ordinary supervised regression on return targets.

Actor-Critic Methods

Actor-critic methods combine a policy model and a value model.

Component	Role
Actor	Chooses actions
Critic	Estimates value or advantage
Policy loss	Updates actor
Value loss	Updates critic
Entropy term	Encourages exploration

A combined loss may look like:

L = L_{\text{policy}} + c_v L_{\text{value}} - c_e H(\pi_\theta),

where $H(\pi_\theta)$ is policy entropy.

Entropy regularization encourages the policy to avoid collapsing too early to deterministic actions. For a discrete policy:

H(\pi_\theta(\cdot \mid s)) = - \sum_a \pi_\theta(a \mid s) \log \pi_\theta(a \mid s).

AD differentiates this entropy term directly through the policy probabilities.

Q-Learning and Temporal Difference Losses

Value-based methods learn an action-value function:

Q_\theta(s,a) \approx \mathbb{E} \left[ \sum_{k=t}^{T} \gamma^{k-t}r_k \mid s_t=s, a_t=a \right].

A temporal difference target is:

y = r + \gamma \max_{a'} Q_{\theta^-}(s', a').

The loss is:

L(\theta) = \left( Q_\theta(s,a) - y \right)^2.

Here, $\theta^-$ often denotes a target network. The target is usually treated as constant with respect to the current parameters. In an AD system, this means the target is detached from the gradient graph.

q = Q(theta, state, action)

with no_grad:
    target = reward + gamma * max_a Q(target_theta, next_state, a)

loss = squared_error(q, target)
backward(loss)

The detach is semantically important. Without it, gradients would flow into the target computation and change the algorithm.

Differentiable Environments

Some environments are differentiable. Physics simulators, control systems, and learned world models may expose gradients through state transitions:

s_{t+1} = f_\psi(s_t, a_t).

If actions are continuous and the transition function is differentiable, we can backpropagate through time:

\frac{\partial J}{\partial \theta} = \frac{\partial J}{\partial s_T} \frac{\partial s_T}{\partial s_{T-1}} \cdots \frac{\partial a_t}{\partial \theta}.

This resembles training recurrent networks. The trajectory becomes a differentiable computation graph.

However, many real environments contain discontinuities, contacts, collisions, discrete events, unknown dynamics, or external systems. In those cases, score-function estimators, value methods, or model-free methods are used.

Reparameterized Continuous Actions

For continuous stochastic policies, a sample may be written as a differentiable transformation of parameter-free noise:

a = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0,I).

This is the reparameterization trick. It allows gradients to flow through the sampled action into policy parameters when the downstream computation is differentiable.

This is useful in methods that optimize differentiable objectives involving sampled continuous actions. When the environment remains nondifferentiable, the benefit is limited to differentiable parts of the objective.

Credit Assignment

Reinforcement learning has a credit assignment problem. An action may affect rewards many steps later. The learning algorithm must estimate which earlier actions caused later outcomes.

Backpropagation handles credit assignment through differentiable computation graphs. RL often needs statistical credit assignment through sampled trajectories.

The distinction is important.

In a differentiable recurrent model, an output loss at time $T$ can backpropagate through every hidden state to earlier parameters.

In a nondifferentiable environment, there may be no derivative path from reward back to action. Policy gradients use probability scores to assign credit statistically.

Off-Policy Data and Replay

Many RL algorithms learn from replay buffers. A replay buffer stores transitions:

(state, action, reward, next_state, done)

The model trains on sampled transitions rather than only the latest trajectory.

For value learning, this works naturally because the loss is a supervised temporal difference loss.

For policy gradient methods, off-policy learning requires corrections or specialized objectives because the data may have been generated by an older policy.

AD only differentiates the loss constructed from the sampled batch. The statistical validity of using that batch belongs to the RL algorithm.

Common AD Pitfalls in RL

Reinforcement learning code often contains explicit graph boundaries. Mistakes at these boundaries change the algorithm.

Returns and advantages are usually treated as constants in the policy loss. If they are computed by a learned value function, they may need to be detached when updating the actor.

Target Q-values are usually detached.

Sampled discrete actions are not differentiated as ordinary tensors.

Log probabilities must correspond to the action actually sampled.

Entropy bonuses should use the current policy distribution, not stale probabilities unless the algorithm requires them.

Environment state should not accidentally retain computation graphs across episodes unless differentiable simulation is intended.

These details matter because RL losses are often assembled manually.

AD Boundary in Reinforcement Learning

A clean RL implementation separates four layers:

policy/value networks:
    differentiable tensor programs

environment:
    transition and reward source

estimator:
    constructs policy, value, or model-based losses

AD engine:
    differentiates those losses with respect to selected parameters

Automatic differentiation does not turn reinforcement learning into ordinary supervised learning. It supplies reliable derivatives for the differentiable parts: policy logits, log probabilities, value predictions, entropy terms, differentiable dynamics, and model losses.

The RL algorithm defines the estimator. The AD system computes the derivative of that estimator.