四足机器人强化学习：PPO 算法的 Python 实现与解析 | 极客日志

PythonAI算法

四足机器人强化学习：PPO 算法的 Python 实现与解析

详细解析了 rsl_rl 仓库中 PPO 算法的 Python 实现。内容涵盖仓库结构概览、PPO 核心公式回顾（概率比率、GAE、熵）、代码关键模块分析（初始化、经验回放、动作采样、环境反馈、收益计算）以及核心的 update 训练循环。重点讲解了策略裁剪、价值裁剪、KL 散度控制及自适应学习率机制，旨在帮助开发者理解四足机器人强化学习控制的底层逻辑与优化目标。

橘子海发布于 2026/4/5更新于 2026/7/1060 浏览

前言

本系列将着手解析整个仓库的核心代码与算法实现和训练教程。此系列默认读者拥有一定的强化学习基础和代码基础，故在部分原理和基础代码逻辑不做解释，对强化学习基础感兴趣的读者可以阅读相关入门资料。
阅读本系列的前置知识：
- python语法，明白面向对象的封装
- pytorch基础使用
- 神经网络基础知识
- 强化学习基础知识，至少了解 Policy Gradient、Actor-Critic 和 PPO
本期将讲解 rsl_rl 仓库的 PPO 算法的 python 实现。

Unitree RL GYM 是一个开源的 基于 Unitree 机器人强化学习（Reinforcement Learning, RL）控制示例项目，用于训练、测试和部署四足机器人控制策略。该仓库支持多种 Unitree 机器人型号，包括 Go2、H1、H1_2 和 G1。仓库地址

![图片]

0 仓库安装

关于仓库的安装和环境配置官方的文档已经非常清楚了，这里就不在赘述。
通过下述指令可以快速获取仓库代码。

官方教程

![图片]

git clone https://github.com/leggedrobotics/rsl_rl.git 
cd rsl_rl 
git checkout v1.0.2

0-1 PPO 公式回顾

姑且这里回顾一下 PPO 的核心公式。PPO 的目标函数是： $$L^{clip}(\theta)=\mathbb{E}[\min(r(\theta)A,\mathrm{clip}(r(\theta),1-\epsilon,1+\epsilon)A)]$$ 其中:
- $r(\theta)$：新旧策略概率比
- $A$：Advantage（优势函数）
- $\epsilon$：裁剪范围，一般取 0.1~0.2

概率比率（Probability Ratio） $r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}$ 它表示：
- 新策略和旧策略在某个动作上的概率比例。
- 如果 r ≈ 1，说明新旧策略 差不多
- 如果 r >> 1 或 r << 1，说明策略 变化太大
通过上述公式，PPO 会限制 $r(\theta)$ 的取值范围 $[1-\epsilon, 1+\epsilon]$。如果超过这个范围，梯度就会被裁剪，不再继续增大。

GAE（Generalized Advantage Estimation）:
- 它通过引入一个参数 λ（lambda），将 多步 TD 误差进行加权平均，从而得到更加稳定的 Advantage 估计。
公式：$A_t^{GAE} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$ 其中：$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ 表示。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

rsl_rl/
├── algorithms/
├── env/
├── modules/
├── runners/
├── storage/
└── utils/

algorithms/
├── __init__.py
└── ppo.py

env/
├── __init__.py
└── vec_env.py

modules/
├── actor_critic.py
├── actor_critic_recurrent.py

runners/
└── on_policy_runner.py

storage/
└── rollout_storage.py

utils/
└── utils.py

algorithms/
├── __init__.py
└── ppo.py

# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Copyright (c) 2021 ETH Zurich, Nikita Rudin
import torch
import torch.nn as nn
import torch.optim as optim
from rsl_rl.modules import ActorCritic
from rsl_rl.storage import RolloutStorage

class PPO:
    actor_critic: ActorCritic

    def __init__(self, actor_critic, num_learning_epochs=1, num_mini_batches=1, clip_param=0.2, gamma=0.998, lam=0.95, value_loss_coef=1.0, entropy_coef=0.0, learning_rate=1e-3, max_grad_norm=1.0, use_clipped_value_loss=True, schedule="fixed", desired_kl=0.01, device='cpu',):
        self.device = device
        self.desired_kl = desired_kl
        self.schedule = schedule
        self.learning_rate = learning_rate
        # PPO components
        self.actor_critic = actor_critic
        self.actor_critic.to(self.device)
        self.storage = None  # initialized later
        self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=learning_rate)
        self.transition = RolloutStorage.Transition()
        # PPO parameters
        self.clip_param = clip_param
        self.num_learning_epochs = num_learning_epochs
        self.num_mini_batches = num_mini_batches
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.gamma = gamma
        self.lam = lam
        self.max_grad_norm = max_grad_norm
        self.use_clipped_value_loss = use_clipped_value_loss

    def init_storage(self, num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape):
        self.storage = RolloutStorage(num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape, self.device)

    def test_mode(self):
        self.actor_critic.test()

    def train_mode(self):
        self.actor_critic.train()

    def act(self, obs, critic_obs):
        if self.actor_critic.is_recurrent:
            self.transition.hidden_states = self.actor_critic.get_hidden_states()
        # Compute the actions and values
        self.transition.actions = self.actor_critic.act(obs).detach()
        self.transition.values = self.actor_critic.evaluate(critic_obs).detach()
        self.transition.actions_log_prob = self.actor_critic.get_actions_log_prob(self.transition.actions).detach()
        self.transition.action_mean = self.actor_critic.action_mean.detach()
        self.transition.action_sigma = self.actor_critic.action_std.detach()
        # need to record obs and critic_obs before env.step()
        self.transition.observations = obs
        self.transition.critic_observations = critic_obs
        return self.transition.actions

    def process_env_step(self, rewards, dones, infos):
        self.transition.rewards = rewards.clone()
        self.transition.dones = dones
        # Bootstrapping on time outs
        if 'time_outs' in infos:
            self.transition.rewards += self.gamma * torch.squeeze(self.transition.values * infos['time_outs'].unsqueeze(1).to(self.device), 1)
        # Record the transition
        self.storage.add_transitions(self.transition)
        self.transition.clear()
        self.actor_critic.reset(dones)

    def compute_returns(self, last_critic_obs):
        last_values = self.actor_critic.evaluate(last_critic_obs).detach()
        self.storage.compute_returns(last_values, self.gamma, self.lam)

    def update(self):
        mean_value_loss = 0
        mean_surrogate_loss = 0
        if self.actor_critic.is_recurrent:
            generator = self.storage.reccurent_mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
        else:
            generator = self.storage.mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
        for obs_batch, critic_obs_batch, actions_batch, target_values_batch, advantages_batch, returns_batch, old_actions_log_prob_batch, \
            old_mu_batch, old_sigma_batch, hid_states_batch, masks_batch in generator:
            self.actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[0])
            actions_log_prob_batch = self.actor_critic.get_actions_log_prob(actions_batch)
            value_batch = self.actor_critic.evaluate(critic_obs_batch, masks=masks_batch, hidden_states=hid_states_batch[1])
            mu_batch = self.actor_critic.action_mean
            sigma_batch = self.actor_critic.action_std
            entropy_batch = self.actor_critic.entropy
            # KL
            if self.desired_kl is not None and self.schedule == 'adaptive':
                with torch.inference_mode():
                    kl = torch.sum(
                        torch.log(sigma_batch / old_sigma_batch + 1.e-5) + (torch.square(old_sigma_batch) + torch.square(old_mu_batch - mu_batch)) / (2.0 * torch.square(sigma_batch)) - 0.5, axis=-1)
                    kl_mean = torch.mean(kl)
                    if kl_mean > self.desired_kl * 2.0:
                        self.learning_rate = max(1e-5, self.learning_rate / 1.5)
                    elif kl_mean < self.desired_kl / 2.0 and kl_mean > 0.0:
                        self.learning_rate = min(1e-2, self.learning_rate * 1.5)
                for param_group in self.optimizer.param_groups:
                    param_group['lr'] = self.learning_rate
            # Surrogate loss
            ratio = torch.exp(actions_log_prob_batch - torch.squeeze(old_actions_log_prob_batch))
            surrogate = -torch.squeeze(advantages_batch) * ratio
            surrogate_clipped = -torch.squeeze(advantages_batch) * torch.clamp(ratio, 1.0 - self.clip_param, 1.0 + self.clip_param)
            surrogate_loss = torch.max(surrogate, surrogate_clipped).mean()
            # Value function loss
            if self.use_clipped_value_loss:
                value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-self.clip_param, self.clip_param)
                value_losses = (value_batch - returns_batch).pow(2)
                value_losses_clipped = (value_clipped - returns_batch).pow(2)
                value_loss = torch.max(value_losses, value_losses_clipped).mean()
            else:
                value_loss = (returns_batch - value_batch).pow(2).mean()
            loss = surrogate_loss + self.value_loss_coef * value_loss - self.entropy_coef * entropy_batch.mean()
            # Gradient step
            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.max_grad_norm)
            self.optimizer.step()
            mean_value_loss += value_loss.item()
            mean_surrogate_loss += surrogate_loss.item()
        num_updates = self.num_learning_epochs * self.num_mini_batches
        mean_value_loss /= num_updates
        mean_surrogate_loss /= num_updates
        self.storage.clear()
        return mean_value_loss, mean_surrogate_loss

class PPO:
    actor_critic: ActorCritic
    def __init__(self, actor_critic, num_learning_epochs=1, num_mini_batches=1, clip_param=0.2, gamma=0.998, lam=0.95, value_loss_coef=1.0, entropy_coef=0.0, learning_rate=1e-3, max_grad_norm=1.0, use_clipped_value_loss=True, schedule="fixed", desired_kl=0.01, device='cpu',):
        ...

self.storage = None  # initialized later
self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=learning_rate)
self.transition = RolloutStorage.Transition()

def init_storage(self, num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape):
    self.storage = RolloutStorage(
        num_envs, num_transitions_per_env, actor_obs_shape, critic_obs_shape, action_shape, self.device
    )

def test_mode(self):
    self.actor_critic.test()

def train_mode(self):
    self.actor_critic.train()

def act(self, obs, critic_obs):
    if self.actor_critic.is_recurrent:
        self.transition.hidden_states = self.actor_critic.get_hidden_states()
    # Compute the actions and values
    self.transition.actions = self.actor_critic.act(obs).detach()
    self.transition.values = self.actor_critic.evaluate(critic_obs).detach()
    self.transition.actions_log_prob = self.actor_critic.get_actions_log_prob(self.transition.actions).detach()
    self.transition.action_mean = self.actor_critic.action_mean.detach()
    self.transition.action_sigma = self.actor_critic.action_std.detach()
    # need to record obs and critic_obs before env.step()
    self.transition.observations = obs
    self.transition.critic_observations = critic_obs
    return self.transition.actions

def act(self, obs, critic_obs):

if self.actor_critic.is_recurrent:
    self.transition.hidden_states = self.actor_critic.get_hidden_states()

self.transition.actions = self.actor_critic.act(obs).detach()
self.transition.values = self.actor_critic.evaluate(critic_obs).detach()

self.transition.actions_log_prob = self.actor_critic.get_actions_log_prob(self.transition.actions).detach()

self.transition.action_mean = self.actor_critic.action_mean.detach()
self.transition.action_sigma = self.actor_critic.action_std.detach()

# need to record obs and critic_obs before env.step()
self.transition.observations = obs
self.transition.critic_observations = critic_obs
return self.transition.actions

def process_env_step(self, rewards, dones, infos):
    self.transition.rewards = rewards.clone()
    self.transition.dones = dones
    # Bootstrapping on time outs
    if 'time_outs' in infos:
        self.transition.rewards += self.gamma * torch.squeeze(self.transition.values * infos['time_outs'].unsqueeze(1).to(self.device), 1)
    # Record the transition
    self.storage.add_transitions(self.transition)
    self.transition.clear()
    self.actor_critic.reset(dones)

self.transition.rewards = rewards.clone()
self.transition.dones = dones

# Bootstrapping on time outs
if 'time_outs' in infos:
    self.transition.rewards += self.gamma * torch.squeeze(self.transition.values * infos['time_outs'].unsqueeze(1).to(self.device), 1)

# Record the transition
self.storage.add_transitions(self.transition)
self.transition.clear()
self.actor_critic.reset(dones)

def compute_returns(self, last_critic_obs):
    last_values = self.actor_critic.evaluate(last_critic_obs).detach()
    self.storage.compute_returns(last_values, self.gamma, self.lam)

def update(self):
    mean_value_loss = 0
    mean_surrogate_loss = 0
    if self.actor_critic.is_recurrent:
        generator = self.storage.reccurent_mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
    else:
        generator = self.storage.mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
    for obs_batch, critic_obs_batch, actions_batch, target_values_batch, advantages_batch, returns_batch, old_actions_log_prob_batch, \
        old_mu_batch, old_sigma_batch, hid_states_batch, masks_batch in generator:
        self.actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[0])
        actions_log_prob_batch = self.actor_critic.get_actions_log_prob(actions_batch)
        value_batch = self.actor_critic.evaluate(critic_obs_batch, masks=masks_batch, hidden_states=hid_states_batch[1])
        mu_batch = self.actor_critic.action_mean
        sigma_batch = self.actor_critic.action_std
        entropy_batch = self.actor_critic.entropy
        # KL
        if self.desired_kl is not None and self.schedule == 'adaptive':
            with torch.inference_mode():
                kl = torch.sum(
                    torch.log(sigma_batch / old_sigma_batch + 1.e-5) + (torch.square(old_sigma_batch) + torch.square(old_mu_batch - mu_batch)) / (2.0 * torch.square(sigma_batch)) - 0.5, axis=-1)
                kl_mean = torch.mean(kl)
                if kl_mean > self.desired_kl * 2.0:
                    self.learning_rate = max(1e-5, self.learning_rate / 1.5)
                elif kl_mean < self.desired_kl / 2.0 and kl_mean > 0.0:
                    self.learning_rate = min(1e-2, self.learning_rate * 1.5)
            for param_group in self.optimizer.param_groups:
                param_group['lr'] = self.learning_rate
        # Surrogate loss
        ratio = torch.exp(actions_log_prob_batch - torch.squeeze(old_actions_log_prob_batch))
        surrogate = -torch.squeeze(advantages_batch) * ratio
        surrogate_clipped = -torch.squeeze(advantages_batch) * torch.clamp(ratio, 1.0 - self.clip_param, 1.0 + self.clip_param)
        surrogate_loss = torch.max(surrogate, surrogate_clipped).mean()
        # Value function loss
        if self.use_clipped_value_loss:
            value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-self.clip_param, self.clip_param)
            value_losses = (value_batch - returns_batch).pow(2)
            value_losses_clipped = (value_clipped - returns_batch).pow(2)
            value_loss = torch.max(value_losses, value_losses_clipped).mean()
        else:
            value_loss = (returns_batch - value_batch).pow(2).mean()
        loss = surrogate_loss + self.value_loss_coef * value_loss - self.entropy_coef * entropy_batch.mean()
        # Gradient step
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.max_grad_norm)
        self.optimizer.step()
        mean_value_loss += value_loss.item()
        mean_surrogate_loss += surrogate_loss.item()
    num_updates = self.num_learning_epochs * self.num_mini_batches
    mean_value_loss /= num_updates
    mean_surrogate_loss /= num_updates
    self.storage.clear()
    return mean_value_loss, mean_surrogate_loss

mean_value_loss = 0
mean_surrogate_loss = 0
if self.actor_critic.is_recurrent:
    generator = self.storage.reccurent_mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)
else:
    generator = self.storage.mini_batch_generator(self.num_mini_batches, self.num_learning_epochs)

for obs_batch, critic_obs_batch, actions_batch, target_values_batch, advantages_batch, returns_batch, old_actions_log_prob_batch, \
    old_mu_batch, old_sigma_batch, hid_states_batch, masks_batch in generator:

self.actor_critic.act(obs_batch, masks=masks_batch, hidden_states=hid_states_batch[0])

actions_log_prob_batch = self.actor_critic.get_actions_log_prob(actions_batch)

value_batch = self.actor_critic.evaluate(critic_obs_batch)

mu_batch = self.actor_critic.action_mean
sigma_batch = self.actor_critic.action_std
entropy_batch = self.actor_critic.entropy

# KL
if self.desired_kl is not None and self.schedule == 'adaptive':
    with torch.inference_mode():
        kl = torch.sum(
            torch.log(sigma_batch / old_sigma_batch + 1.e-5) + (torch.square(old_sigma_batch) + torch.square(old_mu_batch - mu_batch)) / (2.0 * torch.square(sigma_batch)) - 0.5, axis=-1)
        kl_mean = torch.mean(kl)
        if kl_mean > self.desired_kl * 2.0:
            self.learning_rate = max(1e-5, self.learning_rate / 1.5)
        elif kl_mean < self.desired_kl / 2.0 and kl_mean > 0.0:
            self.learning_rate = min(1e-2, self.learning_rate * 1.5)
    for param_group in self.optimizer.param_groups:
        param_group['lr'] = self.learning_rate

if kl_mean > self.desired_kl * 2.0:
    self.learning_rate = max(1e-5, self.learning_rate / 1.5)
elif kl_mean < self.desired_kl / 2.0 and kl_mean > 0.0:
    self.learning_rate = min(1e-2, self.learning_rate * 1.5)
for param_group in self.optimizer.param_groups:
    param_group['lr'] = self.learning_rate

KL	说明
太大	更新太猛
太小	更新太慢

ratio = torch.exp(actions_log_prob_batch - old_actions_log_prob_batch)

surrogate = -torch.squeeze(advantages_batch) * ratio
surrogate_clipped = -torch.squeeze(advantages_batch) * torch.clamp(ratio, 1.0 - self.clip_param, 1.0 + self.clip_param)
surrogate_loss = torch.max(surrogate, surrogate_clipped).mean()

# Value function loss
if self.use_clipped_value_loss:
    value_clipped = target_values_batch + (value_batch - target_values_batch).clamp(-self.clip_param, self.clip_param)
    value_losses = (value_batch - returns_batch).pow(2)
    value_losses_clipped = (value_clipped - returns_batch).pow(2)
    value_loss = torch.max(value_losses, value_losses_clipped).mean()
else:
    value_loss = (returns_batch - value_batch).pow(2).mean()
loss = surrogate_loss + self.value_loss_coef * value_loss - self.entropy_coef * entropy_batch.mean()

# Gradient step
self.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.max_grad_norm)
self.optimizer.step()
mean_value_loss += value_loss.item()
mean_surrogate_loss += surrogate_loss.item()

num_updates = self.num_learning_epochs * self.num_mini_batches
mean_value_loss /= num_updates
mean_surrogate_loss /= num_updates
self.storage.clear()

四足机器人强化学习：PPO 算法的 Python 实现与解析

前言

0 仓库安装

0-1 PPO 公式回顾

更多推荐文章

相关免费在线工具

1 仓库一览

1-1 `algorithms/` 目录

1-2 `env/` 目录

1-3 `modules/` 目录

1-4 `runners/` 目录

1-5 `storage/` 目录

1-6 `utils/` 目录

2 PPO 算法的 `python` 实现

2-1 代码一览

2-2 初始化函数

2-3 初始化经验回放缓存函数 `init_storage()`

2-4 模式函数

2-5 行动函数 `act()`

2-6 处理环境反馈函数 `process_env_step()`

2-7 计算收获函数 `compute_returns`

3 核心函数 update

3-1 完整实现

3-2 参数定义

3-3 每个 batch

3-4 KL 散度控制

3-5 PPO 核心：概率比率

3-6 PPO 又一核心 Clip 裁切

3-7 损失函数

3-8 梯度推进

3-9 外层循环收尾工作

3-10 PPO 训练循环

小结

更多推荐文章

相关免费在线工具

四足机器人强化学习：PPO 算法的 Python 实现与解析

前言

0 仓库安装

0-1 PPO 公式回顾

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

1 仓库一览

1-1 algorithms/ 目录

1-2 env/ 目录

1-3 modules/ 目录

1-4 runners/ 目录

1-5 storage/ 目录

1-6 utils/ 目录

2 PPO 算法的 python 实现

2-1 代码一览

2-2 初始化函数

2-3 初始化经验回放缓存函数 init_storage()

2-4 模式函数

2-5 行动函数 act()

2-6 处理环境反馈函数 process_env_step()

2-7 计算收获函数 compute_returns

3 核心函数 update

3-1 完整实现

3-2 参数定义

3-3 每个 batch

3-4 KL 散度控制

3-5 PPO 核心：概率比率

3-6 PPO 又一核心 Clip 裁切

3-7 损失函数

3-8 梯度推进

3-9 外层循环收尾工作

3-10 PPO 训练循环

小结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

1-1 `algorithms/` 目录

1-2 `env/` 目录

1-3 `modules/` 目录

1-4 `runners/` 目录

1-5 `storage/` 目录

1-6 `utils/` 目录

2 PPO 算法的 `python` 实现

2-3 初始化经验回放缓存函数 `init_storage()`

2-5 行动函数 `act()`

2-6 处理环境反馈函数 `process_env_step()`

2-7 计算收获函数 `compute_returns`