基于 SWIFT 的 VLLM 推理加速与部署实战
本文介绍了基于 MS-SWIFT 框架和 VLLM 引擎的大语言模型推理加速与部署方案。内容涵盖环境配置、Qwen 与 ChatGLM 模型的推理与流式输出、CLI 工具使用、微调后模型的合并与部署、Web UI 搭建以及多 LoRA 部署策略。重点展示了如何通过 Python 脚本和 Shell 命令调用 VLLM 后端,兼容 OpenAI API 标准,实现高吞吐量的模型服务化。

本文介绍了基于 MS-SWIFT 框架和 VLLM 引擎的大语言模型推理加速与部署方案。内容涵盖环境配置、Qwen 与 ChatGLM 模型的推理与流式输出、CLI 工具使用、微调后模型的合并与部署、Web UI 搭建以及多 LoRA 部署策略。重点展示了如何通过 Python 脚本和 Shell 命令调用 VLLM 后端,兼容 OpenAI API 标准,实现高吞吐量的模型服务化。

支持 GPU 设备:A10, 3090, V100, A100 等。
# 设置 pip 全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装 ms-swift
pip install 'ms-swift[llm]' -U
# vllm 与 cuda 版本有对应关系,请按照官方文档选择版本
pip install vllm -U
pip install openai -U
# 环境对齐 (通常不需要运行。如果你运行错误,可以跑下面的代码,仓库使用最新环境测试)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
model_type = ModelType.qwen_7b_chat
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
# 与 `transformers.GenerationConfig` 类似的接口
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好吃的', 'history': history1}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
print(f"history: {resp['history']}")
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_stream_vllm
)
model_type = ModelType.qwen_7b_chat
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
gen = inference_stream_vllm(llm_engine, template, request_list)
query_list = [request['query'] for request in request_list]
print(f"query_list: {query_list}")
for resp_list in gen:
response_list = [resp['response'] for resp in resp_list]
print(f'response_list: {response_list}')
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好吃的', 'history': history1}]
gen = inference_stream_vllm(llm_engine, template, request_list)
query = request_list[0]['query']
print(f"query: {query}")
for resp_list in gen:
response = resp_list[0]['response']
print(f'response: {response}')
history = resp_list[0]['history']
print()
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
model_type = ModelType.chatglm3_6b
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好吃的', 'history': history1}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
print(f"history: {resp['history']}")
# qwen
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-7b-chat --infer_backend vllm
# yi
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat --infer_backend vllm
# gptq
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat-int4 --infer_backend vllm
单样本推理: 使用 LoRA 进行微调的模型你需要先 merge-lora,产生完整的 checkpoint 目录。 使用全参数微调的模型可以无缝使用 VLLM 进行推理加速。
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
ckpt_dir = 'vx-xxx/checkpoint-100-merged'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
llm_engine = get_vllm_engine(model_type, model_id_or_path=ckpt_dir)
tokenizer = llm_engine.hf_tokenizer
template = get_template(template_type, tokenizer)
query = '你好'
resp = inference_vllm(llm_engine, template, [{'query': query}])[0]
print(f"response: {resp['response']}")
print(f"history: {resp['history']}")
使用 CLI:
# merge LoRA 增量权重并使用 vllm 进行推理加速
# 如果你需要量化,可以指定 `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
# 使用数据集评估
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' \
--infer_backend vllm \
--load_dataset_config true
# 人工评估
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' \
--infer_backend vllm
CUDA_VISIBLE_DEVICES=0 swift app-ui --model_type qwen-7b-chat --infer_backend vllm
# merge LoRA 增量权重并使用 vllm 作为 backend 构建 app-ui
# 如果你需要量化,可以指定 `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' --infer_backend vllm
swift 使用 VLLM 作为推理后端,并兼容 OpenAI 的 API 样式。 服务端的部署命令行参数可以参考相关文档。 客户端的 OpenAI 的 API 参数可以参考 OpenAI 官方文档。
服务端:
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b-chat
# 多卡部署
RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type qwen-7b-chat --tensor_parallel_size 4
客户端:
测试:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b-chat",
"messages": [{"role": "user", "content": "晚上睡不着觉怎么办?"}],
"max_tokens": 256,
"temperature": 0
}'
使用 Swift:
from swift.llm import get_model_list_client, XRequestConfig, inference_client
model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')
query = '浙江的省会在哪里?'
request_config = XRequestConfig(seed=42)
resp = inference_client(model_type, query, request_config=request_config)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
history = [(query, response)]
query = '这有什么好吃的?'
request_config = XRequestConfig(stream=True, seed=42)
stream_resp = inference_client(model_type, query, history, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
使用 OpenAI:
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')
query = '浙江的省会在哪里?'
messages = [{
'role': 'user',
'content': query
}]
resp = client.chat.completions.create(
model=model_type,
messages=messages,
seed=42)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
# 流式
messages.append({'role': 'assistant', 'content': response})
query = '这有什么好吃的?'
messages.append({'role': 'user', 'content': query})
stream_resp = client.chat.completions.create(
model=model_type,
messages=messages,
stream=True,
seed=42)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
服务端:
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b
# 多卡部署
RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type qwen-7b --tensor_parallel_size 4
客户端:
测试:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b",
"prompt": "浙江 -> 杭州\n安徽 -> 合肥\n四川 ->",
"max_tokens": 32,
"temperature": 0.1,
"seed": 42
}'
使用 Swift:
from swift.llm import get_model_list_client, XRequestConfig, inference_client
model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')
query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
request_config = XRequestConfig(max_tokens=32, temperature=0.1, seed=42)
resp = inference_client(model_type, query, request_config=request_config)
response = resp.choices[0].text
print(f'query: {query}')
print(f'response: {response}')
request_config.stream = True
stream_resp = inference_client(model_type, query, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].text, end='', flush=True)
print()
使用 OpenAI:
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')
query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
kwargs = {'model': model_type, 'prompt': query, 'seed': 42, 'temperature': 0.1, 'max_tokens': 32}
resp = client.completions.create(**kwargs)
response = resp.choices[0].text
print(f'query: {query}')
print(f'response: {response}')
# 流式
stream_resp = client.completions.create(stream=True, **kwargs)
response = resp.choices[0].text
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].text, end='', flush=True)
print()
服务端:
# merge LoRA 增量权重并部署
# 如果你需要量化,可以指定 `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'
客户端示例代码同原始模型。
目前 pt 方式部署模型已经支持 peft>=0.10.0 进行多 LoRA 部署,具体方法为:
merge_lora 为 False--lora_modules 参数,可以查看命令行文档举例:
# 假设从 llama3-8b-instruct 训练了一个名字叫卡卡罗特的 LoRA 模型
# 服务端
swift deploy --ckpt_dir /mnt/ckpt-1000 --infer_backend pt --lora_modules my_tuner=/mnt/my-tuner
# 会加载起来两个 tuner,一个是 `/mnt/ckpt-1000` 的 `default-lora`,一个是 `/mnt/my-tuner` 的 `my_tuner`
# 客户端
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "my-tuner",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
# resp: 我是卡卡罗特...
# 如果指定 mode='llama3-8b-instruct',则返回 I'm llama3...,即原模型的返回值
注意:
--ckpt_dir参数如果是个 lora 路径,则原来的 default 会被加载到 default-lora 的 tuner 上,其他的 tuner 需要通过lora_modules自行加载。
VLLM & LoRA 支持的模型可以查看官方文档。
# Experimental environment: 4 * A100
# 4 * 30GB GPU memory
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
swift sft \
--model_type llama2-7b-chat \
--dataset sharegpt-gpt4-mini \
--train_dataset_sample 1000 \
--logging_steps 5 \
--max_length 4096 \
--learning_rate 5e-5 \
--warmup_ratio 0.4 \
--output_dir output \
--lora_target_modules ALL \
--self_cognition_sample 500 \
--model_name 小黄 'Xiao Huang' \
--model_author 魔搭 ModelScope
将 lora 从 swift 格式转换成 peft 格式:
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/llama2-7b-chat/vx-xxx/checkpoint-xxx \
--to_peft_format true
推理:
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/llama2-7b-chat/vx-xxx/checkpoint-xxx-peft \
--infer_backend vllm \
--vllm_enable_lora true
单样本推理:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import torch
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_stream_vllm, LoRARequest, inference_vllm
)
lora_checkpoint = 'output/llama2-7b-chat/vx-xxx/checkpoint-xxx-peft'
lora_request = LoRARequest('default-lora', 1, lora_checkpoint)
model_type = ModelType.llama2_7b_chat
llm_engine = get_vllm_engine(model_type, torch.float16, enable_lora=True,
max_loras=1, max_lora_rank=16)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
# use lora
request_list = [{'query': 'who are you?'}]
query = request_list[0]['query']
resp_list = inference_vllm(llm_engine, template, request_list, lora_request=lora_request)
response = resp_list[0]['response']
print(f'query: {query}')
print(f'response: {response}')
# no lora
gen = inference_stream_vllm(llm_engine, template, request_list)
query = request_list[0]['query']
print(f'query: {query}\nresponse: ', end='')
print_idx = 0
for resp_list in gen:
response = resp_list[0]['response']
print(response[print_idx:], end='', flush=True)
print_idx = (response)
()
服务端:
CUDA_VISIBLE_DEVICES=0 swift deploy \
--ckpt_dir output/llama2-7b-chat/vx-xxx/checkpoint-xxx-peft \
--infer_backend vllm \
--vllm_enable_lora true
客户端:
测试:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default-lora",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2-7b-chat",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
使用 OpenAI:
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type_list = [model.id for model in client.models.list().data]
print(f'model_type_list: {model_type_list}')
query = 'who are you?'
messages = [{
'role': 'user',
'content': query
}]
resp = client.chat.completions.create(
model='default-lora',
messages=messages,
seed=42)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
# 流式
stream_resp = client.chat.completions.create(
model='llama2-7b-chat',
messages=messages,
stream=True,
seed=42)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
本文详细介绍了如何使用 MS-SWIFT 框架结合 VLLM 引擎进行大语言模型的推理加速与生产部署。通过上述步骤,开发者可以实现从环境配置、基础推理、流式输出到多卡部署及 LoRA 微调模型集成的完整工作流。VLLM 的高吞吐量特性显著降低了推理延迟,而 SWIFT 提供的统一接口简化了不同模型(如 Qwen、ChatGLM)的适配过程。在生产环境中,建议根据硬件资源合理配置 Tensor Parallelism,并利用 LoRA 技术实现高效的多租户模型管理。

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online