昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比 | 极客日志

PythonAI算法

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比。基于昇腾 910B NPU 配合 CANN 环境，对 ops-transformer 算子进行全场景性能测试。通过配置多批次、序列长度及注意力头数，对比原生 PyTorch 注意力机制。结果显示，ops-transformer 在时延、吞吐量及显存占用上均有显著优化，尤其在高复杂度场景下优势明显，为 LLM 训练推理提供高效方案。

zhang发布于 2026/3/24更新于 2026/6/1119 浏览

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比

前言

基于昇腾 910B NPU 配合 CANN 8.2+ 环境，我们聚焦于 ops-transformer 算子的完整部署与性能验证。通过清晰的环境配置、依赖安装、多场景性能测试以及与原生 PyTorch 注意力的对比实验，直观呈现其在低时延、高吞吐量及显存优化上的核心优势，为 LLM 训练/推理等 NLP 任务提供高效、可落地的算子应用参考。

前提条件

开发环境准备

确保计算资源已配置为 NPU 类型，以利用昇腾芯片的专用算力执行 AI 算子。硬件建议配置如下：

NPU 硬件：NPU basic · 1 * NPU 910B
CPU：64GB
容器镜像：ubuntu22.04-py3.11-cann8.2.rc1-sglang-main-notebook（或兼容版本）

环境依赖安装

在 Ubuntu 22.04 + Python 3.11 + CANN 8.2 环境中，Python、GCC、CMake 通常已预装完成。若需手动确认或补全，请检查以下版本要求：

Python >= 3.7.0
GCC >= 7.3.0
CMake >= 3.16.0
pigz（可选，建议 >= 2.4，提升打包速度）
dos2unix
Gawk
googletest（仅执行单元测试时依赖，建议 release-1.11.0）

项目根目录提供了 install_deps.sh 脚本用于一键自动化安装依赖。若遇到系统不支持的情况，可按以下步骤手动独立配置：

安装 Gawk
安装 dos2unix
安装 zlib（pigz 依赖）
安装 pigz
安装 googletest
配置环境变量

环境依赖项验证

运行以下脚本可快速验证关键工具链是否就绪：

import sys
import subprocess
import os

def get_command_version(cmd, version_pattern):
    """执行命令并提取版本号"""
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        for line in result.stdout.splitlines():
            if version_pattern in line:
                return line.strip()
        return "未知版本"
    except (subprocess.CalledProcessError, FileNotFoundError):
         

 ():
    
    python_version = sys.version.split()[]
    ()

    
    gcc_version = get_command_version([, ], )
    gcc_ver = gcc_version.split()[-]    gcc_version  gcc_version
    ()

    
    cmake_version = get_command_version([, ], )
    cmake_ver = cmake_version.split()[]  (cmake_version.split()) >=   cmake_version
    ()

    
    pigz_version = get_command_version([, ], )
    pigz_ver = pigz_version.split()[]    pigz_version  pigz_version
    ()

    
    dos2unix_version = get_command_version([, ], )
    dos2unix_ver = dos2unix_version.split()[]    dos2unix_version  dos2unix_version
    ()

    
    gawk_version = get_command_version([, ], )
    gawk_ver = gawk_version.split()[]  (gawk_version.split()) >=   gawk_version
    ()

    
    gtest_inc = os.path.expanduser()
    gtest_lib = os.path.expanduser()
     os.path.exists(gtest_inc)  os.path.exists(gtest_lib):
        ()

 __name__ == :
    main()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

# 下载 legacy 包
wget https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/community/8.5.0.alpha001/Ascend-cann-toolkit_8.5.0.alpha001_linux-aarch64.run
wget https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/community/8.5.0.alpha001/cann-910b-ops-legacy_8.5.0.alpha001_linux-aarch64.run

# 下载 ops-math 包
wget https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/community/cann-910b-ops-math_8.3.RC1_linux-aarch64.run

# 赋予所有包执行权限
chmod +x Ascend-cann-toolkit_8.5.0.alpha001_linux-aarch64.run \
       cann-910b-ops-legacy_8.5.0.alpha001_linux-aarch64.run \
       cann-910b-ops-math_8.3.RC1_linux-aarch64.run

# 1. 安装 CANN Toolkit
./Ascend-cann-toolkit_8.5.0.alpha001_linux-aarch64.run --full --force --install-path=$HOME/.local/Ascend

# 2. 安装 legacy 包
./cann-910b-ops-legacy_8.5.0.alpha001_linux-aarch64.run --full --install-path=$HOME/.local/Ascend

# 3. 安装 ops-math 包
./cann-910b-ops-math_8.3.RC1_linux-aarch64.run --full --install-path=$HOME/.local/Ascend

# 加载配置
source ~/.bashrc

# 定义实际安装路径（根据实际安装情况调整）
TOOLKIT_ROOT="$HOME/.local/Ascend/8.5.0.alpha001"
MATH_ROOT="$HOME/.local/Ascend/8.3.RC1"

# 配置核心环境变量
export PATH="$TOOLKIT_ROOT/bin:$PATH"
export LD_LIBRARY_PATH="$TOOLKIT_ROOT/lib64:$TOOLKIT_ROOT/opp_legacy/lib64:$MATH_ROOT/ops_math/lib64:$LD_LIBRARY_PATH"
export PYTHONPATH="$TOOLKIT_ROOT/python/site-packages:$PYTHONPATH"
export ASCEND_HOME="$TOOLKIT_ROOT"

# 下载项目源码
git clone https://gitcode.com/cann/ops-transformer.git
cd ops-transformer

# 安装根目录 requirements.txt 依赖
pip3 install -r requirements.txt

import torch
import time
import sys

# ==================== 基础配置 ====================
DEVICE = 0
WARMUP_TIMES = 20  # 预热次数：避免 NPU 冷启动误差
TEST_TIMES = 100   # 测试次数：取平均值，结果更稳定
TORCH_VERSION = torch.__version__
NPU_AVAILABLE = torch.npu.is_available()

# ==================== 多场景测试配置 ====================
# 配置格式：(batch_size, seq_len, num_heads, head_dim)
TEST_CONFIGS = [
    (4, 256, 4, 64),      # 小规模（基础验证场景）
    (8, 512, 8, 64),      # 中规模（常见 LLM 基础配置）
    (4, 1024, 8, 64),     # 长序列（考验长文本处理扩展性）
    (16, 256, 8, 64),     # 大批次（考验高并发吞吐量）
    (8, 512, 16, 64),     # 多注意力头（考验并行计算能力）
    (2, 2048, 8, 64),     # 超长序列（极限场景验证）
    (32, 128, 4, 64),     # 超大批次（极限并发场景）
]

# ==================== 核心测试函数 ====================
def ascend_flash_attention(query, key, value, mask):
    """昇腾 PyTorch 内置的 ops-transformer 优化注意力算子"""
    return torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=mask, dropout_p=0.0, is_causal=False
    )

def benchmark(config):
    """单场景性能测试函数"""
    batch, seq_len, heads, head_dim = config

    # 构造符合规范的 NPU 输入张量
    query = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    key = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    value = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    mask = torch.ones(batch, 1, seq_len, seq_len, dtype=torch.bool).npu()

    # 预热阶段
    for _ in range(WARMUP_TIMES):
        ascend_flash_attention(query, key, value, mask)
    torch.npu.synchronize()

    # 重置显存统计
    torch.npu.reset_peak_memory_stats()

    # 计时测试阶段
    start_time = time.time()
    for _ in range(TEST_TIMES):
        ascend_flash_attention(query, key, value, mask)
    torch.npu.synchronize()
    total_time = time.time() - start_time

    # 计算核心性能指标
    avg_latency = (total_time / TEST_TIMES) * 1000
    throughput = (TEST_TIMES * batch) / total_time
    peak_memory = torch.npu.max_memory_allocated() / 1024 / 1024

    return avg_latency, throughput, peak_memory

# ==================== 主程序 ====================
if __name__ == "__main__":
    print("=" * 90)
    print("📋 ops-transformer 完整性能测试报告")
    print("=" * 90)
    print(f"【环境信息】")
    print(f" PyTorch 版本：{TORCH_VERSION}")
    print(f" NPU 设备可用：{'✅' if NPU_AVAILABLE else '❌'}")
    print(f" 测试设备：NPU-{DEVICE}")
    print(f" 预热次数：{WARMUP_TIMES} | 测试次数：{TEST_TIMES}")
    print(f" 数据精度：float32")
    print("=" * 90)

    if not NPU_AVAILABLE:
        print("❌ 错误：NPU 环境未配置就绪")
        sys.exit(1)

    print(f"\n{'测试场景':<30} {'平均时延(ms)':<15} {'吞吐量 (样本/秒)':<20} {'峰值显存 (MB)':<15}")
    print("-" * 90)

    for idx, config in enumerate(TEST_CONFIGS, 1):
        batch, seq_len, heads, head_dim = config
        scene_name = f"场景{idx} (B{batch}, S{seq_len}, H{heads}, D{head_dim})"
        try:
            latency, throughput, memory = benchmark(config)
            print(f"{scene_name:<30} {latency:<15.2f} {throughput:<20.0f} {memory:<15.0f}")
        except Exception as e:
            print(f"{scene_name:<30} {'❌ 测试失败':<15} {'-':<20} {'-':<15}")
            print(f"{'':<30} 错误信息：{str(e)[:60]}...")

    print("=" * 90)

场景	配置（batch, seq_len, heads）	核心测试目标
小规模	(4, 256, 4)	基础性能（快速验证）
中规模	(8, 512, 8)	常见 LLM 场景（平衡性能）
长序列	(4, 1024, 8)	扩展性（考验长文本处理）
大批次	(16, 256, 8)	吞吐量（高并发场景）
多注意力头	(8, 512, 16)	并行能力（复杂模型场景）

import torch
import time
import sys
import math

# ==================== 基础配置 ====================
DEVICE = 0
WARMUP_TIMES = 20
TEST_TIMES = 50
TORCH_VERSION = torch.__version__
NPU_AVAILABLE = torch.npu.is_available()

# ==================== 多场景测试配置 ====================
TEST_CONFIGS = [
    (4, 256, 4, 64),
    (8, 512, 8, 64),
    (4, 1024, 8, 64),
    (16, 256, 8, 64),
    (8, 512, 16, 64),
]

# ==================== 双版本注意力实现 ====================
def ascend_ops_transformer(query, key, value, mask):
    """昇腾内置 ops-transformer 优化注意力"""
    return torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=mask, dropout_p=0.0, is_causal=False
    )

def torch_vanilla_attention(query, key, value, mask):
    """原生 PyTorch 注意力（基准版本）"""
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn_weights = torch.softmax(scores, dim=-1)
    return torch.matmul(attn_weights, value)

# ==================== 通用基准测试函数 ====================
def benchmark(attention_func, config, version_name):
    batch, seq_len, heads, head_dim = config
    query = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    key = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    value = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    mask = torch.ones(batch, 1, seq_len, seq_len, dtype=torch.bool).npu()

    for _ in range(WARMUP_TIMES):
        attention_func(query, key, value, mask)
    torch.npu.synchronize()
    torch.npu.reset_peak_memory_stats()

    start_time = time.time()
    for _ in range(TEST_TIMES):
        attention_func(query, key, value, mask)
    torch.npu.synchronize()
    total_time = time.time() - start_time

    avg_latency = (total_time / TEST_TIMES) * 1000
    throughput = (TEST_TIMES * batch) / total_time
    peak_memory = torch.npu.max_memory_allocated() / 1024 / 1024
    return avg_latency, throughput, peak_memory

# ==================== 主程序 ====================
if __name__ == "__main__":
    print("=" * 120)
    print("📋 ops-transformer vs 原生 PyTorch 注意力 性能对比测试")
    print("=" * 120)
    print(f"【环境信息】")
    print(f" PyTorch 版本：{TORCH_VERSION}")
    print(f" NPU 设备可用：{'✅' if NPU_AVAILABLE else '❌'}")
    print(f" 测试设备：NPU-{DEVICE}")
    print(f" 预热次数：{WARMUP_TIMES} | 测试次数：{TEST_TIMES}")
    print(f" 数据精度：float32")
    print("=" * 120)

    if not NPU_AVAILABLE:
        print("❌ 错误：NPU 环境未配置就绪")
        sys.exit(1)

    print(f"\n{'测试场景':<30} {'版本':<20} {'平均时延 (ms)':<15} {'吞吐量 (样本/秒)':<20} {'峰值显存 (MB)':<15} {'优化倍数 (时延)':<10}")
    print("-" * 120)

    for idx, config in enumerate(TEST_CONFIGS, 1):
        batch, seq_len, heads, head_dim = config
        scene_name = f"场景{idx} (B{batch}, S{seq_len}, H{heads})"

        # 1. 测试原生 PyTorch 注意力
        try:
            vanilla_latency, vanilla_throughput, vanilla_memory = benchmark(torch_vanilla_attention, config, "原生 PyTorch 注意力")
        except Exception as e:
            vanilla_latency = vanilla_throughput = vanilla_memory = "-"
            vanilla_err = str(e)[:40]

        # 2. 测试 ops-transformer
        try:
            ops_latency, ops_throughput, ops_memory = benchmark(ascend_ops_transformer, config, "ops-transformer 优化")
            opt_multiple = f"{vanilla_latency / ops_latency:.1f}x" if vanilla_latency != "-" else "-"
        except Exception as e:
            ops_latency = ops_throughput = ops_memory = opt_multiple = "-"
            ops_err = str(e)[:40]

        # 输出原生版结果
        if vanilla_latency != "-":
            print(f"{scene_name:<30} {'原生 PyTorch 注意力':<20} {vanilla_latency:<15.2f} {vanilla_throughput:<20.0f} {vanilla_memory:<15.0f} {'-':<10}")
        else:
            print(f"{scene_name:<30} {'原生 PyTorch 注意力':<20} {'❌ 测试失败':<15} {'-':<20} {'-':<15} {'-':<10}")

        # 输出优化版结果
        if ops_latency != "-":
            print(f"{scene_name:<30} {'ops-transformer 优化':<20} {ops_latency:<15.2f} {ops_throughput:<20.0f} {ops_memory:<15.0f} {opt_multiple:<10}")
        else:
            print(f"{scene_name:<30} {'ops-transformer 优化':<20} {'❌ 测试失败':<15} {'-':<20} {'-':<15} {'-':<10}")

    print("=" * 120)

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比

前言

前提条件

开发环境准备

环境依赖安装

环境依赖项验证

更多推荐文章

相关免费在线工具

环境准备与配置

下载社区版 CANN 工具包

安装与部署社区版 CANN

环境变量配置

ops-transformer 项目安装与依赖构建

ops-transformer 性能测试

测试脚本准备

测试场景设计逻辑

测试结果与分析

原生 PyTorch 注意力 vs ops-transformer：注意力性能对比测试

测试脚本准备

测试结果与分析

总结

更多推荐文章

相关免费在线工具

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比

昇腾 910B NPU 平台 ops-transformer 算子性能测试与 PyTorch 对比

前言

前提条件

开发环境准备

环境依赖安装

环境依赖项验证

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

环境准备与配置

下载社区版 CANN 工具包

安装与部署社区版 CANN

环境变量配置

ops-transformer 项目安装与依赖构建

ops-transformer 性能测试

测试脚本准备

测试场景设计逻辑

测试结果与分析

原生 PyTorch 注意力 vs ops-transformer：注意力性能对比测试

测试脚本准备

测试结果与分析

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具