摘要
熵正则化是强化学习(RL)中提升探索能力的标准技术。然而,在大语言模型(LLMs)中,它往往效果甚微,甚至会导致性能下降。我们认为,这种失败源于大语言模型所固有的累积尾部风险(cumulative tail risk),这种风险来自其庞大的词表规模以及较长的生成序列长度。
在这样的环境下,标准的全局熵最大化会将概率质量不加区分地分散到大量处于尾部的无效 token 上,而不是集中于合理候选项,从而破坏连贯的推理过程。
为了解决这一问题,我们提出了 Trust Region Entropy(TRE)方法。该方法鼓励模型仅在其'信任区域(trust region)'内进行探索。我们在数学推理任务(MATH)、组合搜索任务(Countdown)以及偏好对齐任务(HH)上进行了大量实验,结果表明,TRE 在各项任务中均稳定优于标准 PPO、传统熵正则化方法以及其他探索基线方法。
相关工作
RL for LLM Alignment
Following the standard Reinforcement Learning from Human Feedback (RLHF) pipeline (Ouyang et al., 2022), models initially trained via supervised fine-tuning are further optimized using algorithms such as Proximal Policy Optimization (PPO) (Schulman et al., 2017) to maximize non-differentiable reward signals. This paradigm has proven effective across various domains, from improving helpfulness and safety (Bai et al., 2022) to enhancing mathematical reasoning capabilities (Guo et al., 2025; Yu et al., 2025).
Entropy Regularization
Entropy regularization is a cornerstone technique in modern RL, encouraging exploration via the entropy term.
While highly effective in low-dimensional continuous control, naive entropy maximization proves problematic for LLMs due to massive vocabulary sizes (Cui et al., 2025).
To mitigate this, contemporaneous works have proposed selective constraint mechanisms.
For instance, Wang et al. (2025) propose Forking-Tokens, which restricts optimization to steps with high entropy to preserve exploratory potential.
Similarly, Cui et al. (2025) introduces KL-Cov, which identifies steps with high covariance between advantage estimates and log-probabilities, selectively imposing a strong KL penalty on these critical steps to stabilize training dynamics.
Trust Region
The concept of a Trust Region is foundational to stable optimization in reinforcement learning.
在策略梯度(Policy Gradient)里,我们本质是在做优化,但面临以下问题:
- 如果一步更新太大,policy 分布剧烈改变
- 重要性采样比率会爆炸
- 训练不稳定甚至崩溃
尤其在做 LLM RL 时,这个问题更明显 —— policy 是 50k 维 softmax,更新稍微大一点就会乱。
所以核心问题变成:如何保证每次 policy 更新不要偏离太远?这就是 Trust Region 思想的来源。
TRPO → PPO 的演进其实是'理论最优 + 复杂约束' → '工程可行 + 近似替代'。
TRPO (2015) Trust Region Policy Optimization
TRPO (Schulman et al., 2015) constrains the policy update by enforcing a strict KL-divergence constraint on a surrogate objective, ensuring monotonic improvement while maintaining stability. This surrogate objective is designed to approximate the true objective while keeping the updates within a trust region defined by the KL-divergence.


