TensorRT-LLM 大模型推理加速实战指南 | 极客日志

PythonAI

TensorRT-LLM 大模型推理加速实战指南

TensorRT-LLM 是 NVIDIA 推出的高性能大模型推理库，其安装、引擎构建及基于 Triton 的部署流程。内容涵盖环境配置、Qwen 模型适配、Inflight Batching 优化及常见网络问题排查，旨在帮助开发者实现低延迟、高吞吐的 LLM 推理服务。

指针猎手发布于 2025/2/6更新于 2026/6/425 浏览

TensorRT-LLM 大模型推理加速实战指南

1. 简介

TensorRT-LLM 是由 NVIDIA 发布的高性能推理库，专为大型语言模型（LLM）设计。它提供了易于使用的 Python API，允许用户定义 LLM 并构建包含最先进优化的 TensorRT 引擎，从而在 NVIDIA GPU 上实现高效的推理执行。相比通用推理框架，TensorRT-LLM 针对 Transformer 架构进行了深度优化，支持量化、连续批处理等高级特性。

2. 环境准备与安装

2.1 基础依赖

建议直接使用 Docker 环境进行部署，以避免本地环境配置冲突。首先需要安装 NVIDIA TensorRT。

sudo dpkg -i nv-tensorrt-local-repo-ubuntu2204-8.6.1-cuda-12.0_1.0-1_amd64.deb
sudo cp /var/nv-tensorrt-local-repo-ubuntu2204-8.6.1-cuda-12.0/nv-tensorrt-local-42B2FC56-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install tensorrt

2.2 安装 TensorRT-LLM

通过 pip 安装预发布版本：

pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
python3 -c "import tensorrt_llm"

注意：TensorRT-LLM 与 TGI（Text Generation Inference）软件版本可能存在冲突，请确保环境隔离。

2.3 Triton Inference Server

为了在生产环境中部署，通常结合 Triton Inference Server 使用。

docker pull nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3

3. 构建 TensorRT 引擎

要为现有模型创建 TensorRT 引擎，主要包含三个步骤：下载预训练权重、构建完全优化的模型引擎、部署引擎。

由于目前某些特定模型（如 Qwen2）可能尚未完全支持，示例中使用 Qwen-7B 模型。如果模型不在仓库 examples 目录下支持列表中，需要自行编写代码实现模型转化功能。

3.1 构建命令详解

python build.py --hf_model_dir /home/chuan/models/qwen/Qwen-7B-Chat-Int4 \
                --quant_ckpt_path /home/chuan/models/qwen/Qwen-7B-Chat-Int4 \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --weight_only_precision int4_gptq \
                --per_group \
                --world_size 1 \
                --tp_size 1 \
                --output_dir /home/chuan/models/qwen/Qwen-7B-Chat-Int4/trt_engines/int4-gptq/1-gpu

参数说明：

--hf_model_dir: HuggingFace 模型目录路径。
--quant_ckpt_path: 量化检查点路径。

相关免费在线工具

RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online

python3 ../run.py --input_text "你好，请问你叫什么？" \
                  --max_output_len=50 \
                  --tokenizer_dir /home/chuan/models/qwen/Qwen-7B-Chat-Int4 \
                  --engine_dir=/home/chuan/models/qwen/Qwen-7B-Chat-Int4/trt_engines/int4-gptq/1-gpu

cp /home/chuan/models/qwen/Qwen-7B-Chat-Int4/trt_engines/int4-gptq/1-gpu/* /home/chuan/github/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1

python3 tools/fill_template.py --in_place \
      all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
      decoupled_mode:true,engine_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1,\
gpt_model_type:inflight_fused_batching

python tools/fill_template.py --in_place \
    all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
    tokenizer_type:auto,tokenizer_dir:/all_models/inflight_batcher_llm/tensorrt_llm/1

python3 build.py --hf_model_dir /home/chuan/models/qwen/Qwen-7B-Chat-Int4 \
                --quant_ckpt_path /home/chuan/models/qwen/Qwen-7B-Chat-Int4 \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --weight_only_precision int4_gptq \
                --per_group \
                --world_size 1 \
                --tp_size 1 \
                --output_dir /home/chuan/models/qwen/Qwen-7B-Chat-Int4/trt_engines/int4-gptq/1-gpu \
                --use_inflight_batching \
                --paged_kv_cache

docker run -it --gpus all --network host --shm-size=2g \
-v /home/chuan/github/tensorrtllm_backend:/opt/tensorrtllm_backend \
-v /home/chuan/github/TensorRT-LLM:/opt/TensorRT-LLM \
-p 8000:8000 \
triton_trt_llm

pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install sentencepiece protobuf

python3 /opt/tensorrtllm_backend/scripts/launch_triton_server.py --model_repo /opt/tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1

curl -X POST localhost:8000/v2/models/generate \
-H "Content-Type: application/json" \
-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好，你叫什么？<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 50}'

pip install tiktoken

max_batch_size: 2

--use_inflight_batching
--paged_kv_cache

{
 "proxies": {
   "default": {
     "httpProxy": "http://172.17.0.1:3128",
     "httpsProxy": "https://172.17.0.1:3129",
     "noProxy": "*.test.example.com,.example.org,127.0.0.0/8"
   }
 }
}

sudo mkdir -p /etc/systemd/system/docker.service.d
sudo vi /etc/systemd/system/docker.service.d/http-proxy.conf

[Service]
Environment="HTTP_PROXY=http://172.17.0.1:3128"
Environment="HTTPS_PROXY=https://172.17.0.1:3129"
Environment="NO_PROXY=localhost,127.0.0.1,docker-registry.example.com,.corp"

sudo systemctl daemon-reload
sudo systemctl restart docker

TensorRT-LLM 大模型推理加速实战指南

TensorRT-LLM 大模型推理加速实战指南

1. 简介

2. 环境准备与安装

2.1 基础依赖

2.2 安装 TensorRT-LLM

2.3 Triton Inference Server

3. 构建 TensorRT 引擎

3.1 构建命令详解

更多推荐文章

相关免费在线工具

3.2 运行引擎

4. 在 Triton 上部署

4.1 配置文件编写

4.1.1 TensorRT-LLM 配置

4.1.2 Preprocessing 与 Postprocessing

4.2 启动服务

4.3 本地访问测试

5. 常见问题与优化

5.1 Qwen 模型特殊处理

5.2 网络代理设置

6. 总结

7. 性能优化策略

7.1 量化技术

7.2 连续批处理 (Inflight Batching)

7.3 分页 KV 缓存 (Paged KV Cache)

7.4 多卡并行

更多推荐文章

相关免费在线工具

TensorRT-LLM 大模型推理加速实战指南

TensorRT-LLM 大模型推理加速实战指南

1. 简介

2. 环境准备与安装

2.1 基础依赖

2.2 安装 TensorRT-LLM

2.3 Triton Inference Server

3. 构建 TensorRT 引擎

3.1 构建命令详解

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3.2 运行引擎

4. 在 Triton 上部署

4.1 配置文件编写

4.1.1 TensorRT-LLM 配置

4.1.2 Preprocessing 与 Postprocessing

4.2 启动服务

4.3 本地访问测试

5. 常见问题与优化

5.1 Qwen 模型特殊处理

5.2 网络代理设置

6. 总结

7. 性能优化策略

7.1 量化技术

7.2 连续批处理 (Inflight Batching)

7.3 分页 KV 缓存 (Paged KV Cache)

7.4 多卡并行

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具