OpenSpiel 进阶教程：用 C++ 与 Python 实现自定义博弈算法

综述由AI生成介绍 OpenSpiel 框架下使用 C++ 和 Python 实现自定义博弈算法的方法。内容涵盖核心架构（信息状态、策略、价值函数），Python 端基于 JAX 的 LOLA 算法实现，以及 C++ 端经典 CFR 算法的实现细节。文章提供了关键代码示例，包括策略网络定义、更新逻辑及后悔值计算，并简述了调试可视化和评估优化的步骤，旨在帮助开发者快速掌握博弈算法开发技巧。

宁静发布于 2026/3/25更新于 2026/5/616 浏览

OpenSpiel 进阶教程：用 C++ 与 Python 实现自定义博弈算法

OpenSpiel 是一个强大的博弈算法研究框架，提供了丰富的环境和算法支持。本文将带你深入了解如何在 OpenSpiel 中使用 C++ 和 Python 实现自定义博弈算法，从基础架构到实际代码示例，助你快速掌握博弈算法开发技巧。

🎮 自定义博弈算法的核心架构

在开始编写代码前，我们需要理解 OpenSpiel 中博弈算法的基本架构。OpenSpiel 将博弈问题抽象为信息状态（Information State） 和策略（Policy） 的交互，算法通过优化策略来最大化预期收益。

核心组件解析

信息状态（InfoState）：包含玩家当前可观察的所有信息，用于决策
策略（Policy）：将信息状态映射为动作概率分布
价值函数（Value Function）：估计特定状态的预期收益
后悔值匹配（Regret Matching）：通过累积后悔值更新策略的经典方法

🐍 Python 实现：基于 JAX 的 LOLA 算法

Python 接口适合快速原型开发，OpenSpiel 提供了 JAX 和 PyTorch 等深度学习框架的集成。以下是基于 JAX 实现 LOLA（Learning with Opponent-Learning Awareness）算法的关键步骤：

1. 定义策略网络

def get_policy_network(num_actions):
    def network(inputs):
        h = hk.Linear(64)(inputs)
        h = jax.nn.relu(h)
        logits = hk.Linear(num_actions)(h)
        return distrax.Categorical(logits=logits)
    return hk.Transformed(network)

2. 实现 LOLA 更新逻辑

LOLA 算法通过考虑对手策略更新来优化自身策略，核心代码如下：

def get_lola_update_fn(agent_id, policy_network, optimizer, pi_lr=0.001, lola_weight=1.0):
    def loss_fn(params, batch):
        # 计算策略梯度损失
        logits = vmap(lambda s: policy_network.apply(params, s).logits)(batch.info_state)
        adv = batch.returns - batch.values
        return vmap(rlax.policy_gradient_loss)(logits, batch.action, adv).mean()

    def update(train_state, batch):
        # 基础策略梯度更新
        loss, grads = jax.value_and_grad(loss_fn)(train_state.policy_params[agent_id], batch)
        # LOLA 修正项计算
        correction = lola_correction(train_state, batch)
        grads = jax.tree_map(lambda g, c: g - lola_weight * c, grads, correction)
        # 应用梯度更新
        updates, opt_state = optimizer(grads, train_state.policy_opt_states[agent_id])
        policy_params = optax.apply_updates(train_state.policy_params[agent_id], updates)
        return TrainState(...), {'loss': loss}
    return update

3. 运行训练循环

# 初始化环境和智能体
env = rl_environment.Environment("kuhn_poker")
agent = OpponentShapingAgent(
    player_id=0,
    opponent_ids=[1],
    info_state_size=env.observation_spec()["info_state"][0],
    num_actions=env.action_spec()["num_actions"],
    policy=get_policy_network(env.action_spec()["num_actions"]),
    correction_type="lola"
)

# 训练循环
for _ in range(1000):
    time_step = env.reset()
    while not time_step.last():
        agent_output = agent.step(time_step)
        time_step = env.step([agent_output.action])

🚀 C++ 实现：经典 CFR 算法

C++ 实现适合追求高性能的场景，OpenSpiel 核心算法如 CFR（Counterfactual Regret Minimization）均采用 C++ 编写。以下是 CFR 算法的关键实现：

1. 信息状态价值存储

struct CFRInfoStateValues {
    std::vector<Action> legal_actions;
    std::vector<double> cumulative_regrets; // 累积后悔值
    std::vector<double> cumulative_policy; // 累积策略
    std::vector<double> current_policy; // 当前策略
};

2. 后悔值匹配更新

// 应用后悔值匹配更新策略
void CFRInfoStateValues::ApplyRegretMatching() {
    double sum_positive_regrets = 0.0;
    for (int aidx = 0; aidx < num_actions(); ++aidx) {
        if (cumulative_regrets[aidx] > 0) {
            sum_positive_regrets += cumulative_regrets[aidx];
        }
    }
    for (int aidx = 0; aidx < num_actions(); ++aidx) {
        current_policy[aidx] = (sum_positive_regrets > 0) ? std::max(cumulative_regrets[aidx], 0.0) / sum_positive_regrets : 1.0 / legal_actions.size();
    }
}

3. 反事实后悔值计算

// 递归计算反事实价值和后悔值
std::vector<double> CFRSolverBase::ComputeCounterFactualRegret(
    const State& state, const absl::optional<int>& alternating_player,
    const std::vector<double>& reach_probabilities) {
    if (state.IsTerminal()) return state.Returns();
    int current_player = state.CurrentPlayer();
    std::string info_state = state.InformationStateString(current_player);
    std::vector<Action> legal_actions = state.LegalActions();
    // 获取当前策略
    std::vector<double> policy = GetPolicy(info_state, legal_actions);
    // 计算子节点价值
    std::vector<double> child_values;
    std::vector<double> state_value(game_->NumPlayers(), 0.0);
    for (int aidx = 0; aidx < legal_actions.size(); ++aidx) {
        auto child = state.Child(legal_actions[aidx]);
        auto child_reach = reach_probabilities;
        child_reach[current_player] *= policy[aidx];
        auto child_val = ComputeCounterFactualRegret(*child, alternating_player, child_reach);
        for (int i = 0; i < game_->NumPlayers(); ++i) {
            state_value[i] += policy[aidx] * child_val[i];
        }
        child_values.push_back(child_val[current_player]);
    }
    // 更新后悔值
    if (!alternating_player || *alternating_player == current_player) {
        double cfr_reach = CounterFactualReachProb(reach_probabilities, current_player);
        auto& is_vals = info_states_[info_state];
        for (int aidx = 0; aidx < legal_actions.size(); ++aidx) {
            is_vals.cumulative_regrets[aidx] += cfr_reach * (child_values[aidx] - state_value[current_player]);
            is_vals.cumulative_policy[aidx] += reach_probabilities[current_player] * policy[aidx];
        }
    }
    return state_value;
}

🔍 算法调试与可视化

OpenSpiel 提供了丰富的工具帮助调试和可视化博弈算法：

博弈树可视化

Kuhn Poker 的博弈树结构展示了信息状态之间的转换关系。

多群体博弈分析

通过矩阵可视化多群体博弈的均衡状态。

📝 实现步骤总结

问题分析：确定博弈类型（零和/非零和、完美/不完美信息）
算法选择：根据问题特性选择 CFR、LOLA 等合适算法
策略实现：
- Python：继承 rl_agent.AbstractAgent 类
- C++：实现 Policy 接口和价值更新逻辑
评估与优化：使用 evaluate_bots 工具评估性能，调整超参数

环境搭建：

git clone <官方仓库地址>
cd open_spiel && ./install.sh

📚 进阶资源

官方文档：查看项目文档中的 algorithms.md
算法示例：参考 open_spiel/examples/目录
测试代码：查阅 open_spiel/algorithms/cfr_test.cc

通过本文的指导，你已经掌握了在 OpenSpiel 中实现自定义博弈算法的核心方法。无论是基于 Python 的快速原型开发，还是 C++ 的高性能实现，OpenSpiel 都提供了灵活而强大的支持。

OpenSpiel 进阶教程：用 C++ 与 Python 实现自定义博弈算法

宁静发布于 2026/3/25更新于 2026/5/616 浏览

def get_policy_network(num_actions): def network(inputs): h = hk.Linear(64)(inputs) h = jax.nn.relu(h) logits = hk.Linear(num_actions)(h) return distrax.Categorical(logits=logits) return hk.Transformed(network)

def get_lola_update_fn(agent_id, policy_network, optimizer, pi_lr=0.001, lola_weight=1.0): def loss_fn(params, batch): # 计算策略梯度损失 logits = vmap(lambda s: policy_network.apply(params, s).logits)(batch.info_state) adv = batch.returns - batch.values return vmap(rlax.policy_gradient_loss)(logits, batch.action, adv).mean() def update(train_state, batch): # 基础策略梯度更新 loss, grads = jax.value_and_grad(loss_fn)(train_state.policy_params[agent_id], batch) # LOLA 修正项计算 correction = lola_correction(train_state, batch) grads = jax.tree_map(lambda g, c: g - lola_weight * c, grads, correction) # 应用梯度更新 updates, opt_state = optimizer(grads, train_state.policy_opt_states[agent_id]) policy_params = optax.apply_updates(train_state.policy_params[agent_id], updates) return TrainState(...), {'loss': loss} return update

# 初始化环境和智能体 env = rl_environment.Environment("kuhn_poker") agent = OpponentShapingAgent( player_id=0, opponent_ids=[1], info_state_size=env.observation_spec()["info_state"][0], num_actions=env.action_spec()["num_actions"], policy=get_policy_network(env.action_spec()["num_actions"]), correction_type="lola" ) # 训练循环 for _ in range(1000): time_step = env.reset() while not time_step.last(): agent_output = agent.step(time_step) time_step = env.step([agent_output.action])

struct CFRInfoStateValues { std::vector<Action> legal_actions; std::vector<double> cumulative_regrets; // 累积后悔值 std::vector<double> cumulative_policy; // 累积策略 std::vector<double> current_policy; // 当前策略 };

// 应用后悔值匹配更新策略 void CFRInfoStateValues::ApplyRegretMatching() { double sum_positive_regrets = 0.0; for (int aidx = 0; aidx < num_actions(); ++aidx) { if (cumulative_regrets[aidx] > 0) { sum_positive_regrets += cumulative_regrets[aidx]; } } for (int aidx = 0; aidx < num_actions(); ++aidx) { current_policy[aidx] = (sum_positive_regrets > 0) ? std::max(cumulative_regrets[aidx], 0.0) / sum_positive_regrets : 1.0 / legal_actions.size(); } }

// 递归计算反事实价值和后悔值 std::vector<double> CFRSolverBase::ComputeCounterFactualRegret( const State& state, const absl::optional<int>& alternating_player, const std::vector<double>& reach_probabilities) { if (state.IsTerminal()) return state.Returns(); int current_player = state.CurrentPlayer(); std::string info_state = state.InformationStateString(current_player); std::vector<Action> legal_actions = state.LegalActions(); // 获取当前策略 std::vector<double> policy = GetPolicy(info_state, legal_actions); // 计算子节点价值 std::vector<double> child_values; std::vector<double> state_value(game_->NumPlayers(), 0.0); for (int aidx = 0; aidx < legal_actions.size(); ++aidx) { auto child = state.Child(legal_actions[aidx]); auto child_reach = reach_probabilities; child_reach[current_player] *= policy[aidx]; auto child_val = ComputeCounterFactualRegret(*child, alternating_player, child_reach); for (int i = 0; i < game_->NumPlayers(); ++i) { state_value[i] += policy[aidx] * child_val[i]; } child_values.push_back(child_val[current_player]); } // 更新后悔值 if (!alternating_player || *alternating_player == current_player) { double cfr_reach = CounterFactualReachProb(reach_probabilities, current_player); auto& is_vals = info_states_[info_state]; for (int aidx = 0; aidx < legal_actions.size(); ++aidx) { is_vals.cumulative_regrets[aidx] += cfr_reach * (child_values[aidx] - state_value[current_player]); is_vals.cumulative_policy[aidx] += reach_probabilities[current_player] * policy[aidx]; } } return state_value; }

OpenSpiel 进阶教程：用 C++ 与 Python 实现自定义博弈算法