LLM 推理中处理超出上下文长度的外推方法

探讨了大语言模型（LLM）在推理阶段处理超出训练上下文长度（Length Generalization）的几种主流外推方法。主要包括 ALiBi 通过线性偏差增强远距离依赖捕捉；位置内插法（PI）通过缩放位置编码适应长序列但可能影响局部分辨率；NTK-aware Scaled RoPE 及动态缩放 RoPE 通过调整基频参数实现高频外推与低频内插；YaRN 结合灵活进制设计与温度参数优化注意力分布。这些方法旨在保持模型在超长序列下的性能稳健性，减少指标下降。实际工程中需根据硬件资源、上下文需求及微调成本选择合适的方案。

星辰大海发布于 2025/2/7更新于 2026/7/2440 浏览

主要外推方法概述

长度外推（Length Generalization）研究的是如何在预训练时使用较短的序列长度，但在推理时能够泛化到更大的长度。该问题依然是 Transformer 架构亟待解决但尚未完全攻克的技术难点。优秀的长文本外推性能表现为：在泛化到超长序列时，相关指标不会出现大幅下降，模型表现依然稳健。

目前两年内研究长度泛化的经典思路主要包括以下几种：

1. ALiBi 直接外推

ALiBi（Attention with Linear Biases）主要是在计算 attention score 后添加一个不可学习的 bias。其核心公式为在注意力分数上减去距离矩阵与斜率系数的乘积。

假设有 8 个 heads，m 是一个预先定义好的值，可以选择不同的取值来调整每个 head 在注意力计算中的权重分配。具体来说，m 的取值可以是如下：1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256。这些值表示在每个头的注意力计算中，不同头的权重分配逐渐变小，通常是为了增强模型的表达能力或控制注意力的焦点。

ALiBi 由于其使用线性偏差的设计，无法在单层注意力机制中有效捕捉远距离的依赖关系。与标准的自注意力机制不同，ALiBi 的设计通过引入线性偏差来在局部范围内调整注意力分配，从而更适应于捕捉局部信息。然而，它能够进行外推（即捕捉更长距离的信息），这是因为它通过多层的注意力机制逐步扩展感知的范围。换句话说，ALiBi 的远距离信息感知能力依赖于网络的深度，随着层数的增加，模型能够通过多个注意力层逐步捕获更远的依赖信息。但这种能力是有限的，因为它的感知范围随着层数的增加呈线性增长。

2. 位置内插法（Position Interpolation, PI）

将预测的长文本的位置编码乘上因子 Ltrain / Ltest，缩放到训练长度范围内。流程如下：

训练阶段：(1, 2, 3, 4, …, n)
测试阶段：(1, 2, 3, 4, …, n, …, 2n) -> (0.5, 1, …, n) [通过内插的方式来实现]

尽管位置内插（PI）方法有效避免了远距离位置越界的问题，但它也同时压缩了相邻 Token 之间的距离，这可能会严重影响模型的局部分辨率，导致困惑度（PPL）增大。不过，研究表明，经过常规文本微调之后，PI 方法依然能够取得较好的效果。从整体上来看，这种做法实际上是对位置编码中的 sin(m/base^{-2i/d}) 中的 m 进行了缩放处理。

def _compute_linear_scaling_rope_parameters(
    config: Optional[PretrainedConfig] = None,
    device: Optional["torch.device"] = None,
    seq_len: Optional[int] = None,
    **rope_kwargs,
) -> Tuple["torch.Tensor", float]:
    """
    Computes the inverse frequencies with linear scaling.
    Credits to the Reddit user /u/kaiokendev
    Args:
        config ([`~transformers.PretrainedConfig`]):
            The model configuration.
        device (`torch.device`):
            The device to use for initialization of the inverse frequencies.
        seq_len (`int`, *optional*):
            The current sequence length. Unused for this type of RoPE.
        rope_kwargs (`Dict`, *optional*):
            BC compatibility with the previous RoPE class instantiation.
    Returns:
        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies.
    """
     config     (rope_kwargs) > :
         ValueError(
            
            
        )
     (rope_kwargs) > :
        factor = rope_kwargs[]
     config   :
        factor = config.rope_scaling[]

    
    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)

    
    
    
    inv_freq /= factor
     inv_freq, attention_factor

相关免费在线工具

加密/解密文本

使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online

RSA密钥对生成器

生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online

Mermaid 预览与可视化编辑

基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online

随机西班牙地址生成器

随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online

Gemini 图片去水印

基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online

curl 转代码

解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

def _compute_yarn_parameters( config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs ) -> Tuple["torch.Tensor", float]: """ Computes the inverse frequencies with NTK scaling. Please refer to the original paper. """ if len(rope_kwargs) > 0: raise ValueError( f"Unexpected arguments: `**rope_kwargs` should be unset in `_compute_yarn_parameters`, got {rope_kwargs}" ) base = config.rope_theta partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0 head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads) dim = int(head_dim * partial_rotary_factor) max_position_embeddings = config.max_position_embeddings factor = config.rope_scaling["factor"] attention_factor = config.rope_scaling.get("attention_factor") if attention_factor is None: attention_factor = 0.1 * math.log(factor) + 1.0 beta_fast = config.rope_scaling.get("beta_fast") or 32 beta_slow = config.rope_scaling.get("beta_slow") or 1 def find_correction_dim(num_rotations, dim, base, max_position_embeddings): return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base)) def find_correction_range(low_rot, high_rot, dim, base, max_position_embeddings): low = math.floor(find_correction_dim(low_rot, dim, base, max_position_embeddings)) high = math.ceil(find_correction_dim(high_rot, dim, base, max_position_embeddings)) return max(low, 0), min(high, dim - 1) def linear_ramp_factor(min_val, max_val, dim): if min_val == max_val: max_val += 0.001 linear_func = (torch.arange(dim, dtype=torch.float32) - min_val) / (max_val - min_val) ramp_func = torch.clamp(linear_func, 0, 1) return ramp_func pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim) inv_freq_extrapolation = 1.0 / pos_freqs inv_freq_interpolation = 1.0 / (factor * pos_freqs) low, high = find_correction_range(beta_fast, beta_slow, dim, base, max_position_embeddings) inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device) inv_freq = ( inv_freq_interpolation * (1 - inv_freq_extrapolation_factor) + inv_freq_extrapolation * inv_freq_extrapolation_factor ) return inv_freq, attention_factor

LLM 推理中处理超出上下文长度的外推方法

主要外推方法概述

1. ALiBi 直接外推

2. 位置内插法（Position Interpolation, PI）

更多推荐文章

相关免费在线工具

3. NTK-aware 系列

NTK-aware Scaled RoPE

Dynamically Scaled RoPE

4. YaRN 方法

总结与选择建议

更多推荐文章

相关免费在线工具

LLM 推理中处理超出上下文长度的外推方法

主要外推方法概述

1. ALiBi 直接外推

2. 位置内插法（Position Interpolation, PI）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3. NTK-aware 系列

NTK-aware Scaled RoPE

Dynamically Scaled RoPE

4. YaRN 方法

总结与选择建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具