强化学习：策略梯度定理与 REINFORCE 算法

策略学习方法

策略参数化： The idea is to parameterize the policy. For instance, using a neural network π(θ), this policy will output a probability distribution over actions (stochastic policy).

在这里插入图片描述

接受一个状态网络输出的是动作的分布。

在这里插入图片描述

策略学习 vs 价值学习

策略梯度方法能够学习出一种随机策略，而价值函数则无法做到这一点。这会产生两个后果：

我们无需手动进行探索与利用之间的权衡。由于我们输出的是针对行动的概率分布，因此智能体能够在探索状态空间时避免总是遵循相同的路径。

我们还解决了感知混叠的问题。感知混叠指的是当两种状态看起来（或实际上是）相同，但需要采取不同的行动时的情况。

在这里插入图片描述

当然，策略梯度方法也存在一些缺点：

通常，策略梯度方法会收敛到局部最大值而非全局最优值。
策略梯度方法进展较为缓慢，是逐步进行的：训练过程可能会更耗时（效率低下）。
策略梯度方法可能会存在高方差。我们将在'演员 - 评论家'单元中了解其原因以及如何解决这一问题。

偏差和方差的概念：偏差一般指的是预测误差，如果偏差比较低，说明方差一般比较高；

策略梯度方法

目标函数

对于给定参数化策略，我们希望在这个策略下，最大化所有轨迹的期望均值。

在这里插入图片描述

这个等价于：

在这里插入图片描述

其中，每一个轨迹给定的概率分布为 (全概率公式)：

def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every): # Help us to calculate the score during the training scores_deque = deque(maxlen=100) scores = [] # Line 3 of pseudocode for i_episode in range(1, n_training_episodes + 1): saved_log_probs = [] rewards = [] state = env.reset() # Line 4 of pseudocode for t in range(max_t): action, log_prob = policy.act(state) saved_log_probs.append(log_prob) state, reward, done, _ = env.step(action) rewards.append(reward) if done: break scores_deque.append(sum(rewards)) scores.append(sum(rewards)) # Line 6 of pseudocode: calculate the return returns = deque(maxlen=max_t) n_steps = len(rewards) # Compute the discounted returns at each timestep, # as the sum of the gamma-discounted return at time t (G_t) + the reward at time t # In O(N) time, where N is the number of time steps # (this definition of the discounted return G_t follows the definition of this quantity # shown at page 44 of Sutton&Barto 2017 2nd draft) # G_t = r_(t+1) + r_(t+2) + ... # Given this formulation, the returns at each timestep t can be computed # by re-using the computed future returns G_(t+1) to compute the current return G_t # G_t = r_(t+1) + gamma*G_(t+1) # G_(t-1) = r_t + gamma* G_t # (this follows a dynamic programming approach, with which we memorize solutions in order # to avoid computing them multiple times) # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft) # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ... # Given the above, we calculate the returns at timestep t as: # gamma[t] * return[t] + reward[t] # We compute this starting from the last timestep to the first, in order # to employ the formula presented above and avoid redundant computations that would be needed # if we were to do it from first to last. # Hence, the queue "returns" will hold the returns in chronological order, from t=0 to t=n_steps # thanks to the appendleft() function which allows to append to the position 0 in constant time O(1) # a normal python list would instead require O(N) to do this. for t in range(n_steps)[::-1]: # inverse order disc_return_t = (returns[0] if len(returns) > 0 else 0) returns.appendleft(gamma * disc_return_t + rewards[t]) # standardization of the returns is employed to make training more stable eps = np.finfo(np.float32).eps.item() # eps is the smallest representable float, which is # added to the standard deviation of the returns to avoid numerical instabilities returns = torch.tensor(returns) returns = (returns - returns.mean()) / (returns.std() + eps) # Line 7: policy_loss = [] policy_loss = [] for log_prob, disc_return in zip(saved_log_probs, returns): policy_loss.append(-log_prob * disc_return) # G(tau) policy_loss = torch.cat(policy_loss).sum() # Line 8: PyTorch prefers gradient descent optimizer.zero_grad() policy_loss.backward() optimizer.step() if i_episode % print_every == 0: print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque))) return scores

强化学习：策略梯度定理与 REINFORCE 算法

策略学习方法

策略学习 vs 价值学习

策略梯度方法

目标函数

更多推荐文章

相关免费在线工具

策略梯度定理

证明过程

蒙特卡洛 MC Reinforce 算法

策略梯度方法流程：

实现

Reinforce 算法的改进：使用 Gt 替代准确的 R(τ)

参数化策略代码

Reinforce 训练代码

更多推荐文章

相关免费在线工具

强化学习：策略梯度定理与 REINFORCE 算法

策略学习方法

策略学习 vs 价值学习

策略梯度方法

目标函数

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

策略梯度定理

证明过程

蒙特卡洛 MC Reinforce 算法

策略梯度方法流程：

实现

Reinforce 算法的改进：使用 Gt 替代准确的 R(τ)

参数化策略代码

Reinforce 训练代码

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具