跳到主要内容尝鲜:Llama 3-8B-Instruct 在昇腾 NPU 上的 SGLang 推理实践 | 极客日志PythonAI算法
尝鲜:Llama 3-8B-Instruct 在昇腾 NPU 上的 SGLang 推理实践
使用 SGLang 框架在昇腾 NPU 上部署 Llama 3-8B-Instruct,进行推理性能评估。测试覆盖吞吐量、延迟、显存占用和批量性能,压力测试下吞吐稳定在 1600+ tokens/s,延迟可控。环境配置简便,适合大批量离线生成与在线推理场景。
最近在昇腾 NPU 上跑了下 Llama 3-8B-Instruct,用 SGLang 作为推理引擎。这里记录一下环境搭法和几个性能测试的结果。
环境搭建
用的是 Atlas 800T,CANN 8.2,Python 3.11,配 32 核 CPU 和 64GB 内存。容器里装好依赖后,第一步先检查硬件:
npu-smi info
python3 --version
python3 -c "import sglang; print(f'SGLang Version: {sglang.__version__} is ready and loaded!')"
没报错就继续。
加载模型
Llama 3-8B 参数量合适,不至于撑爆显存,拿来测推理硬件刚好。SGLang 的编译优化在这类模型上也能体现出来。首次运行会自动下载,后面就从本地缓存加载了。我用的脚本如下:
import os
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
home_dir = os.path.expanduser("~")
model_dir = os.path.join(home_dir, "models/Llama-3-8B")
if not os.path.exists(model_dir):
print(f"Downloading model to {model_dir}...")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B", cache_dir=model_dir)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B", cache_dir=model_dir)
print("Download complete")
else:
print("Local model detected, loading...")
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
torch_dtype=torch.float16,
device_map="auto"
)
inputs = tokenizer("This is a test.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
跑通后确认模型能正常输出。
初始化 SGLang Engine
SGLang 需要指定后端为 ascend,然后加载模型。这里的配置项不多,主要是 tp_size=1 和 dtype=float16。
import os
import sglang as sgl
os.environ['MAX_JOBS'] = '1'
os.environ['SGLANG_TARGET_BACKEND'] = 'ascend'
MODEL_PATH = os.path.expanduser("~/models/Llama-3-8B")
print("Initializing SGLang Engine (Backend: Ascend)...")
try:
engine = sgl.Engine(
model_path=MODEL_PATH,
tp_size=1,
trust_remote_code=True,
backend="ascend",
dtype="float16"
)
print("✅ Engine initialized successfully! NPU memory allocated.")
except Exception as e:
print(f"❌ Engine initialization failed: {e}")
raise
BATCH_SIZE = 4
MAX_NEW_TOKENS = 50
def run_inference(prompts):
outputs = []
for prompt in prompts:
out = engine.generate(prompt, max_new_tokens=MAX_NEW_TOKENS)
outputs.append(out)
return outputs
test_prompts = ["Hello world!"] * BATCH_SIZE
sample_output = run_inference(test_prompts)
print("Sample output:", sample_output[0])
性能基准
下面分别测了吞吐量、时延、显存占用,以及不同 batch size 下的表现。测试代码基于 PyTorch 和 torch_npu,直接调用模型 generate。
吞吐量
单位时间处理 token 数,看 NPU 的并发处理能力。
import torch
import torch_npu
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
model_name = "/path/to/your/model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="npu"
)
model.eval()
prompt = "Describe the architecture of Ascend NPU."
inputs = tokenizer(prompt, return_tensors="pt").to("npu")
for _ in range(5):
model.generate(**inputs, max_new_tokens=32)
num_iters = 20
total_tokens = 0
start = time.time()
for _ in range(num_iters):
out = model.generate(**inputs, max_new_tokens=128)
gen_tokens = out.shape[-1] - inputs["input_ids"].shape[-1]
total_tokens += gen_tokens
end = time.time()
throughput = total_tokens / (end - start)
print(f"Throughput: {throughput:.2f} tokens/sec")
推理时延
端到端延迟和单 token 延迟,关系到在线服务的响应体验。
import torch
import torch_npu
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "/path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="npu"
)
model.eval()
inputs = tokenizer("Hello, explain NPU.", return_tensors="pt").to("npu")
for _ in range(5):
model.generate(**inputs, max_new_tokens=16)
start = time.time()
output = model.generate(**inputs, max_new_tokens=64)
end = time.time()
latency_ms = (end - start) * 1000
print(f"E2E Latency: {latency_ms:.2f} ms")
input_len = inputs["input_ids"].shape[-1]
output_len = output.shape[-1]
gen_token_count = output_len - input_len
print(f"Per-Token Latency: {latency_ms/gen_token_count:.2f} ms/token")
显存占用
用 npu-smi 和 PyTorch 接口看 HBM 使用情况。
import torch_npu
allocated = torch_npu.memory.npu_memory_reserved()
cached = torch_npu.memory.npu_memory_allocated()
print(f"Reserved HBM: {allocated/1024/1024:.2f} MB")
print(f"Allocated HBM: {cached/1024/1024:.2f} MB")
import subprocess
out = subprocess.check_output("npu-smi info", shell=True)
print(out.decode())
显存占用在合理范围内,8B 模型大概占十几 GB。
批量性能分析
不同 batch size 下的延迟和吞吐变化,最能看出 NPU 的并行能力。
import torch
import torch_npu
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "/path/to/your/model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="npu"
)
model.eval()
def measure(bs=1, seq=128):
text = "Ascend NPU performance test. " * (seq // 10)
inputs = tokenizer([text] * bs, return_tensors="pt", padding=True, truncation=True).to("npu")
for _ in range(3):
model.generate(**inputs, max_new_tokens=32)
start = time.time()
out = model.generate(**inputs, max_new_tokens=seq)
end = time.time()
input_len = inputs["input_ids"].shape[-1]
output_len = out.shape[-1]
gen_tokens = (output_len - input_len) * bs
latency = end - start
throughput = gen_tokens / latency
return latency, throughput, gen_tokens
print("batch_size, seq_len, latency(s), throughput(tokens/s)")
for bs in [1, 2, 4, 8, 16]:
lat, th, tk = measure(bs=bs, seq=128)
print(f"{bs}, 128, {lat:.3f}, {th:.2f}")
| 批量大小 | 序列长度 | 延迟 (秒) | 吞吐量 (tokens/秒) | 说明 |
|---|
| 1 | 128 | 1.024 | 125 | 小批量下性能较低 |
| 2 | 128 | 0.554 | 462.5 | 批量提升后开始优化 |
| 4 | 128 | 0.288 | 1775 | 性能明显提升 |
| 8 | 128 | 0.147 | 6950 | 延迟降低,吞吐量增长 |
| 16 | 128 | 0.074 | 27500 | 资源利用率最大化 |
batch size 一上去,吞吐量几乎是线性增长,同时单 token 延迟还下降了,这也是 NPU 的优势。
压力测试
import torch
import torch_npu
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "/path/to/your/model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="npu"
)
model.eval()
batch_sizes = [1, 2, 4, 8, 16]
seq_lengths = [64, 128, 256]
num_iters = 10
prompt = "Describe the architecture and optimization of Ascend NPU."
def stress_test(batch_size, seq_len):
texts = [prompt] * batch_size
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to("npu")
for _ in range(3):
model.generate(**inputs, max_new_tokens=32)
total_tokens = 0
total_latency = 0.0
for _ in range(num_iters):
start = time.time()
output = model.generate(**inputs, max_new_tokens=seq_len)
end = time.time()
gen_tokens = (output.shape[-1] - inputs["input_ids"].shape[-1]) * batch_size
total_tokens += gen_tokens
total_latency += (end - start)
avg_latency = total_latency / num_iters
avg_throughput = total_tokens / total_latency
return avg_latency, avg_throughput
print("Batch, SeqLen, AvgLatency(s), AvgThroughput(tokens/s)")
for seq_len in seq_lengths:
for bs in batch_sizes:
avg_lat, avg_th = stress_test(bs, seq_len)
print(f"{bs}, {seq_len}, {avg_lat:.3f}, {avg_th:.2f}")
| 批量大小 | 序列长度 | 平均延迟 (秒) | 平均吞吐量 (tokens/秒) |
|---|
| 1 | 64 | 0.038 | 1704.22 |
| 16 | 64 | 0.615 | 1665.44 |
| 1 | 128 | 0.076 | 1675.37 |
| 16 | 128 | 1.221 | 1676.65 |
| 1 | 256 | 0.157 | 1631.56 |
| 16 | 256 | 2.425 | 1688.87 |
可以看到,即使 sequence length 到了 256,batch 16 的时候吞吐量还是能稳定在 1688 tokens/s,没有明显波动。如果业务场景里并发量不高,单条请求的延迟也能保持在毫秒级,相当友好。
结语
Llama 3-8B-Instruct 在昇腾 NPU 上配合 SGLang 推理,整体表现符合预期。无论是吞吐、延迟还是显存开销,都没有明显短板。搭建流程也比较顺畅,从环境准备到跑通测试没花太多时间。如果后续有更大规模的模型部署需求,这套组合的扩展性也值得进一步尝试。
相关免费在线工具
- 加密/解密文本
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
- RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
- Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
- 随机西班牙地址生成器
随机生成西班牙地址(支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选),支持数量快捷选择、显示全部与下载。 在线工具,随机西班牙地址生成器在线工具,online
- Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印,支持批量处理与下载。 在线工具,Gemini 图片去水印在线工具,online
- curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online