昇腾 910B NPU 平台 ops-transformer 算子全场景性能测试与 PyTorch 注意力对比 | 极客日志

PythonAI算法

昇腾 910B NPU 平台 ops-transformer 算子全场景性能测试与 PyTorch 注意力对比

综述由AI生成基于昇腾 910B NPU 平台，演示了 ops-transformer 算子的部署与全场景性能测试。通过对比原生 PyTorch 注意力机制，验证了 ops-transformer 在低时延、高吞吐量及显存优化上的优势。测试覆盖多批次与序列长度场景，结果显示 ops-transformer 时延降低 2.4-4.7 倍，吞吐量提升同等倍数，显存节省最高达 54%，适合 LLM 训练与推理任务。

暖阳发布于 2026/3/29更新于 2026/5/3134 浏览

昇腾 910B NPU 平台 ops-transformer 算子全场景性能测试与 PyTorch 注意力对比

昇腾 910B NPU 平台 ops-transformer 算子：全场景性能测试与验证及与原生 PyTorch 注意力的深度性能对比

前提条件

开发环境准备：NPU 启动配置

计算类型需选择 NPU 以利用昇腾芯片专用算力。硬件配置建议为 NPU basic · 1 * NPU 910B，CPU 64GB。容器镜像建议使用 ubuntu22.04-py3.11-cann8.2.rc1。

环境依赖安装

Python、GCC、CMake 已预装完成，无需重复安装。版本要求如下：

Python >= 3.7.0
GCC >= 7.3.0
CMake >= 3.16.0

ops-transformer 项目源码编译用到的依赖包括 pigz（可选）、dos2unix、Gawk、googletest（仅执行 UT 时依赖）。上述依赖包可通过项目根目录 install_deps.sh 安装。

方式一：一键自动化安装

使用项目脚本安装依赖。

方式二：手动独立配置

Gawk 安装
dos2unix 安装
zlib 安装（pigz 依赖）
pigz 安装
googletest 安装
环境变量配置

环境依赖项验证

import sys
import subprocess
import os

def get_command_version(cmd, version_pattern):
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        for line in result.stdout.splitlines():
            if version_pattern in line:
                return line.strip()
        return "未知版本"
    except (subprocess.CalledProcessError, FileNotFoundError):
        return "未安装"

def main():
    python_version = sys.version.split()[0]
    print(f"✅ Python 已安装，版本 {python_version}")
    gcc_version = get_command_version([, ], )
    gcc_ver = gcc_version.split()[-]    gcc_version  gcc_version
    ()
    cmake_version = get_command_version([, ], )
    cmake_ver = cmake_version.split()[]  (cmake_version.split())>=  cmake_version
    ()
    pigz_version = get_command_version([, ], )
    pigz_ver = pigz_version.split()[]    pigz_version  pigz_version
    ()
    dos2unix_version = get_command_version([, ], )
    dos2unix_ver = dos2unix_version.split()[]    dos2unix_version  dos2unix_version
    ()
    gawk_version = get_command_version([, ], )
    gawk_ver = gawk_version.split()[]  (gawk_version.split())>=  gawk_version
    ()
    gtest_inc = os.path.expanduser()
    gtest_lib = os.path.expanduser()
     os.path.exists(gtest_inc)  os.path.exists(gtest_lib):
        ()

 __name__ == :
    main()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

# 下载 legacy 包
wget https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/community/8.5.0.alpha001/Ascend-cann-toolkit_8.5.0.alpha001_linux-aarch64.run
wget https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/community/8.5.0.alpha001/cann-910b-ops-legacy_8.5.0.alpha001_linux-aarch64.run
wget https://ascend-cann.obs.cn-north-4.myhuaweicloud.com/CANN/community/cann-910b-ops-math_8.3.RC1_linux-aarch64.run
chmod +x Ascend-cann-toolkit_8.5.0.alpha001_linux-aarch64.run cann-910b-ops-legacy_8.5.0.alpha001_linux-aarch64.run cann-910b-ops-math_8.3.RC1_linux-aarch64.run

# 1. 安装 CANN Toolkit
./Ascend-cann-toolkit_8.5.0.alpha001_linux-aarch64.run --full --force --install-path=$HOME/.local/Ascend

# 2. 安装 legacy 包
./cann-910b-ops-legacy_8.5.0.alpha001_linux-aarch64.run --full --install-path=$HOME/.local/Ascend

# 3. 安装 ops-math 包
./cann-910b-ops-math_8.3.RC1_linux-aarch64.run --full --install-path=$HOME/.local/Ascend

# 切换到 bash 终端
bash
# 加载配置
. ~/.bashrc
# 定义实际安装路径
TOOLKIT_ROOT="$HOME/.local/Ascend/8.5.0.alpha001"
MATH_ROOT="$HOME/.local/Ascend/8.3.RC1"
export PATH="$TOOLKIT_ROOT/bin:$PATH"
export LD_LIBRARY_PATH="$TOOLKIT_ROOT/lib64:$TOOLKIT_ROOT/opp_legacy/lib64:$MATH_ROOT/ops_math/lib64:$LD_LIBRARY_PATH"
export PYTHONPATH="$TOOLKIT_ROOT/python/site-packages:$PYTHONPATH"
export ASCEND_HOME="$TOOLKIT_ROOT"

# 下载项目源码
git clone <repository_url>
# 安装根目录 requirements.txt 依赖
pip3 install -r requirements.txt

import torch
import time
import sys

# ==================== 基础配置 ====================
DEVICE = 0
WARMUP_TIMES = 20
TEST_TIMES = 100
TORCH_VERSION = torch.__version__
NPU_AVAILABLE = torch.npu.is_available()

# ==================== 多场景测试配置 ====================
TEST_CONFIGS = [
    (4, 256, 4, 64),
    (8, 512, 8, 64),
    (4, 1024, 8, 64),
    (16, 256, 8, 64),
    (8, 512, 16, 64),
    (2, 2048, 8, 64),
    (32, 128, 4, 64),
]

# ==================== 核心测试函数 ====================
def ascend_flash_attention(query, key, value, mask):
    return torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=mask, dropout_p=0.0, is_causal=False
    )

def benchmark(config):
    batch, seq_len, heads, head_dim = config
    query = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    key = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    value = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    mask = torch.ones(batch, 1, seq_len, seq_len, dtype=torch.bool).npu()

    for _ in range(WARMUP_TIMES):
        ascend_flash_attention(query, key, value, mask)
        torch.npu.synchronize()

    torch.npu.reset_peak_memory_stats()
    start_time = time.time()
    for _ in range(TEST_TIMES):
        ascend_flash_attention(query, key, value, mask)
        torch.npu.synchronize()
    total_time = time.time() - start_time

    avg_latency = (total_time / TEST_TIMES) * 1000
    throughput = (TEST_TIMES * batch) / total_time
    peak_memory = torch.npu.max_memory_allocated() / 1024 / 1024
    return avg_latency, throughput, peak_memory

if __name__ == "__main__":
    print("=" * 90)
    print("📋 ops-transformer 完整性能测试报告")
    print("=" * 90)
    print(f"【环境信息】")
    print(f" PyTorch 版本：{TORCH_VERSION}")
    print(f" NPU 设备可用：{'✅' if NPU_AVAILABLE else '❌'}")
    print(f" 测试设备：NPU-{DEVICE}")
    print(f" 预热次数：{WARMUP_TIMES} | 测试次数：{TEST_TIMES}")
    print(f" 数据精度：float32")
    print("=" * 90)
    if not NPU_AVAILABLE:
        print("❌ 错误：NPU 环境未配置就绪")
        sys.exit(1)

    print(f"\n{'测试场景':<30} {'平均时延(ms)':<15} {'吞吐量 (样本/秒)':<20} {'峰值显存 (MB)':<15}")
    print("-" * 90)
    for idx, config in enumerate(TEST_CONFIGS, 1):
        batch, seq_len, heads, head_dim = config
        scene_name = f"场景{idx} (B{batch}, S{seq_len}, H{heads}, D{head_dim})"
        try:
            latency, throughput, memory = benchmark(config)
            print(f"{scene_name:<30} {latency:<15.2f} {throughput:<20.0f} {memory:<15.0f}")
        except Exception as e:
            print(f"{scene_name:<30} {'❌ 测试失败':<15} {'-':<20} {'-':<15}")
    print("=" * 90)

python3 ops_perf_complete.py

场景	配置（batch, seq_len, heads）	核心测试目标
小规模	(4, 256, 4)	基础性能
中规模	(8, 512, 8)	常见 LLM 场景
长序列	(4, 1024, 8)	扩展性
大批次	(16, 256, 8)	吞吐量
多注意力头	(8, 512, 16)	并行能力

import torch
import time
import sys
import math

DEVICE = 0
WARMUP_TIMES = 20
TEST_TIMES = 50
TORCH_VERSION = torch.__version__
NPU_AVAILABLE = torch.npu.is_available()

TEST_CONFIGS = [
    (4, 256, 4, 64),
    (8, 512, 8, 64),
    (4, 1024, 8, 64),
    (16, 256, 8, 64),
    (8, 512, 16, 64),
]

def ascend_ops_transformer(query, key, value, mask):
    return torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=mask, dropout_p=0.0, is_causal=False
    )

def torch_vanilla_attention(query, key, value, mask):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attn_weights = torch.softmax(scores, dim=-1)
    return torch.matmul(attn_weights, value)

def benchmark(attention_func, config, version_name):
    batch, seq_len, heads, head_dim = config
    query = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    key = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    value = torch.randn(batch, heads, seq_len, head_dim, dtype=torch.float32).npu()
    mask = torch.ones(batch, 1, seq_len, seq_len, dtype=torch.bool).npu()

    for _ in range(WARMUP_TIMES):
        attention_func(query, key, value, mask)
        torch.npu.synchronize()

    torch.npu.reset_peak_memory_stats()
    start_time = time.time()
    for _ in range(TEST_TIMES):
        attention_func(query, key, value, mask)
        torch.npu.synchronize()
    total_time = time.time() - start_time

    avg_latency = (total_time / TEST_TIMES) * 1000
    throughput = (TEST_TIMES * batch) / total_time
    peak_memory = torch.npu.max_memory_allocated() / 1024 / 1024
    return avg_latency, throughput, peak_memory

if __name__ == "__main__":
    print("=" * 120)
    print("📋 ops-transformer vs 原生 PyTorch 注意力 性能对比测试")
    print("=" * 120)
    print(f"【环境信息】")
    print(f" PyTorch 版本：{TORCH_VERSION}")
    print(f" NPU 设备可用：{'✅' if NPU_AVAILABLE else '❌'}")
    print(f" 测试设备：NPU-{DEVICE}")
    print(f" 预热次数：{WARMUP_TIMES} | 测试次数：{TEST_TIMES}")
    print(f" 数据精度：float32")
    print("=" * 120)
    if not NPU_AVAILABLE:
        print("❌ 错误：NPU 环境未配置就绪")
        sys.exit(1)

    print(f"\n{'测试场景':<30} {'版本':<20} {'平均时延 (ms)':<15} {'吞吐量 (样本/秒)':<20} {'峰值显存 (MB)':<15} {'优化倍数 (时延)':<10}")
    print("-" * 120)
    for idx, config in enumerate(TEST_CONFIGS, 1):
        batch, seq_len, heads, head_dim = config
        scene_name = f"场景{idx} (B{batch}, S{seq_len}, H{heads})"
        try:
            vanilla_latency, vanilla_throughput, vanilla_memory = benchmark(torch_vanilla_attention, config, "原生 PyTorch 注意力")
        except Exception as e:
            vanilla_latency = vanilla_throughput = vanilla_memory = "-"
        try:
            ops_latency, ops_throughput, ops_memory = benchmark(ascend_ops_transformer, config, "ops-transformer 优化")
            opt_multiple = f"{vanilla_latency / ops_latency:.1f}x" if vanilla_latency != "-" else "-"
        except Exception as e:
            ops_latency = ops_throughput = ops_memory = opt_multiple = "-"

        if vanilla_latency != "-":
            print(f"{scene_name:<30} {'原生 PyTorch 注意力':<20} {vanilla_latency:<15.2f} {vanilla_throughput:<20.0f} {vanilla_memory:<15.0f} {'-':<10}")
        if ops_latency != "-":
            print(f"{scene_name:<30} {'ops-transformer 优化':<20} {ops_latency:<15.2f} {ops_throughput:<20.0f} {ops_memory:<15.0f} {opt_multiple:<10}")
    print("=" * 120)

python3 ops_perf_complete.py

昇腾 910B NPU 平台 ops-transformer 算子全场景性能测试与 PyTorch 注意力对比

昇腾 910B NPU 平台 ops-transformer 算子：全场景性能测试与验证及与原生 PyTorch 注意力的深度性能对比

前提条件

开发环境准备：NPU 启动配置

环境依赖安装

方式一：一键自动化安装

方式二：手动独立配置

环境依赖项验证

更多推荐文章

相关免费在线工具

环境准备与配置

下载社区版 CANN 工具包

安装与部署社区版 CANN

环境变量配置

ops-transformer 项目安装与依赖构建

ops-transformer 性能测试

测试脚本准备

运行测试脚本

测试结果与分析

原生 PyTorch 注意力 vs ops-transformer：注意力性能对比测试

测试脚本准备

运行测试脚本

测试结果与分析

总结

更多推荐文章

相关免费在线工具

昇腾 910B NPU 平台 ops-transformer 算子全场景性能测试与 PyTorch 注意力对比

昇腾 910B NPU 平台 ops-transformer 算子：全场景性能测试与验证及与原生 PyTorch 注意力的深度性能对比

前提条件

开发环境准备：NPU 启动配置

环境依赖安装

方式一：一键自动化安装

方式二：手动独立配置

环境依赖项验证

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

环境准备与配置

下载社区版 CANN 工具包

安装与部署社区版 CANN

环境变量配置

ops-transformer 项目安装与依赖构建

ops-transformer 性能测试

测试脚本准备

运行测试脚本

测试结果与分析

原生 PyTorch 注意力 vs ops-transformer：注意力性能对比测试

测试脚本准备

运行测试脚本

测试结果与分析

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具