跳到主要内容基于 SWIFT 的 VLLM 推理加速与部署实战 | 极客日志PythonAI算法
基于 SWIFT 的 VLLM 推理加速与部署实战
综述由AI生成基于 MS-SWIFT 框架和 VLLM 引擎的大语言模型推理加速与部署方案。内容涵盖环境配置、Qwen 与 ChatGLM 模型的推理与流式输出、CLI 工具使用、微调后模型的合并与部署、Web UI 搭建以及多 LoRA 部署策略。重点展示了如何通过 Python 脚本和 Shell 命令调用 VLLM 后端,兼容 OpenAI API 标准,实现高吞吐量的模型服务化。
山野来信20 浏览 基于 SWIFT 的 VLLM 推理加速与部署实战
1. 环境准备
支持 GPU 设备:A10, 3090, V100, A100 等。
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip install 'ms-swift[llm]' -U
pip install vllm -U
pip install openai -U
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
2. 推理加速
2.1 Qwen-7B-Chat
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
model_type = ModelType.qwen_7b_chat
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好吃的', : history1}]
resp_list = inference_vllm(llm_engine, template, request_list)
request, resp (request_list, resp_list):
()
()
()
'history'
for
in
zip
print
f"query: {request['query']}"
print
f"response: {resp['response']}"
print
f"history: {resp['history']}"
2.2 流式输出
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_stream_vllm
)
model_type = ModelType.qwen_7b_chat
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
gen = inference_stream_vllm(llm_engine, template, request_list)
query_list = [request['query'] for request in request_list]
print(f"query_list: {query_list}")
for resp_list in gen:
response_list = [resp['response'] for resp in resp_list]
print(f'response_list: {response_list}')
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好吃的', 'history': history1}]
gen = inference_stream_vllm(llm_engine, template, request_list)
query = request_list[0]['query']
print(f"query: {query}")
for resp_list in gen:
response = resp_list[0]['response']
print(f'response: {response}')
history = resp_list[0]['history']
print(f'history: {history}')
2.3 ChatGLM3
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
model_type = ModelType.chatglm3_6b
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
history1 = resp_list[1]['history']
request_list = [{'query': '这有什么好吃的', 'history': history1}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
print(f"history: {resp['history']}")
2.4 使用 CLI
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-7b-chat --infer_backend vllm
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat --infer_backend vllm
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat-int4 --infer_backend vllm
2.5 微调后的模型
单样本推理:
使用 LoRA 进行微调的模型你需要先 merge-lora,产生完整的 checkpoint 目录。
使用全参数微调的模型可以无缝使用 VLLM 进行推理加速。
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
ckpt_dir = 'vx-xxx/checkpoint-100-merged'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
llm_engine = get_vllm_engine(model_type, model_id_or_path=ckpt_dir)
tokenizer = llm_engine.hf_tokenizer
template = get_template(template_type, tokenizer)
query = '你好'
resp = inference_vllm(llm_engine, template, [{'query': query}])[0]
print(f"response: {resp['response']}")
print(f"history: {resp['history']}")
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' \
--infer_backend vllm \
--load_dataset_config true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' \
--infer_backend vllm
3. Web-UI 加速
3.1 原始模型
CUDA_VISIBLE_DEVICES=0 swift app-ui --model_type qwen-7b-chat --infer_backend vllm
3.2 微调后模型
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' --infer_backend vllm
4. 部署
swift 使用 VLLM 作为推理后端,并兼容 OpenAI 的 API 样式。
服务端的部署命令行参数可以参考相关文档。
客户端的 OpenAI 的 API 参数可以参考 OpenAI 官方文档。
4.1 原始模型
Qwen-7B-Chat
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b-chat
RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type qwen-7b-chat --tensor_parallel_size 4
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b-chat",
"messages": [{"role": "user", "content": "晚上睡不着觉怎么办?"}],
"max_tokens": 256,
"temperature": 0
}'
from swift.llm import get_model_list_client, XRequestConfig, inference_client
model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')
query = '浙江的省会在哪里?'
request_config = XRequestConfig(seed=42)
resp = inference_client(model_type, query, request_config=request_config)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
history = [(query, response)]
query = '这有什么好吃的?'
request_config = XRequestConfig(stream=True, seed=42)
stream_resp = inference_client(model_type, query, history, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')
query = '浙江的省会在哪里?'
messages = [{
'role': 'user',
'content': query
}]
resp = client.chat.completions.create(
model=model_type,
messages=messages,
seed=42)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
messages.append({'role': 'assistant', 'content': response})
query = '这有什么好吃的?'
messages.append({'role': 'user', 'content': query})
stream_resp = client.chat.completions.create(
model=model_type,
messages=messages,
stream=True,
seed=42)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
Qwen-7B
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b
RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type qwen-7b --tensor_parallel_size 4
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b",
"prompt": "浙江 -> 杭州\n安徽 -> 合肥\n四川 ->",
"max_tokens": 32,
"temperature": 0.1,
"seed": 42
}'
from swift.llm import get_model_list_client, XRequestConfig, inference_client
model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')
query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
request_config = XRequestConfig(max_tokens=32, temperature=0.1, seed=42)
resp = inference_client(model_type, query, request_config=request_config)
response = resp.choices[0].text
print(f'query: {query}')
print(f'response: {response}')
request_config.stream = True
stream_resp = inference_client(model_type, query, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].text, end='', flush=True)
print()
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')
query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
kwargs = {'model': model_type, 'prompt': query, 'seed': 42, 'temperature': 0.1, 'max_tokens': 32}
resp = client.completions.create(**kwargs)
response = resp.choices[0].text
print(f'query: {query}')
print(f'response: {response}')
stream_resp = client.completions.create(stream=True, **kwargs)
response = resp.choices[0].text
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].text, end='', flush=True)
print()
4.2 微调后模型
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'
4.3 多 LoRA 部署
目前 pt 方式部署模型已经支持 peft>=0.10.0 进行多 LoRA 部署,具体方法为:
- 确保部署时
merge_lora 为 False
- 使用
--lora_modules 参数,可以查看命令行文档
- 推理时指定 lora tuner 的名字到模型字段
swift deploy --ckpt_dir /mnt/ckpt-1000 --infer_backend pt --lora_modules my_tuner=/mnt/my-tuner
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "my-tuner",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
注意:--ckpt_dir 参数如果是个 lora 路径,则原来的 default 会被加载到 default-lora 的 tuner 上,其他的 tuner 需要通过 lora_modules 自行加载。
5. VLLM & LoRA
VLLM & LoRA 支持的模型可以查看官方文档。
5.1 准备 LoRA
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
swift sft \
--model_type llama2-7b-chat \
--dataset sharegpt-gpt4-mini \
--train_dataset_sample 1000 \
--logging_steps 5 \
--max_length 4096 \
--learning_rate 5e-5 \
--warmup_ratio 0.4 \
--output_dir output \
--lora_target_modules ALL \
--self_cognition_sample 500 \
--model_name 小黄 'Xiao Huang' \
--model_author 魔搭 ModelScope
将 lora 从 swift 格式转换成 peft 格式:
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/llama2-7b-chat/vx-xxx/checkpoint-xxx \
--to_peft_format true
5.2 VLLM 推理加速
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/llama2-7b-chat/vx-xxx/checkpoint-xxx-peft \
--infer_backend vllm \
--vllm_enable_lora true
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import torch
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_stream_vllm, LoRARequest, inference_vllm
)
lora_checkpoint = 'output/llama2-7b-chat/vx-xxx/checkpoint-xxx-peft'
lora_request = LoRARequest('default-lora', 1, lora_checkpoint)
model_type = ModelType.llama2_7b_chat
llm_engine = get_vllm_engine(model_type, torch.float16, enable_lora=True,
max_loras=1, max_lora_rank=16)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 256
request_list = [{'query': 'who are you?'}]
query = request_list[0]['query']
resp_list = inference_vllm(llm_engine, template, request_list, lora_request=lora_request)
response = resp_list[0]['response']
print(f'query: {query}')
print(f'response: {response}')
gen = inference_stream_vllm(llm_engine, template, request_list)
query = request_list[0]['query']
print(f'query: {query}\nresponse: ', end='')
print_idx = 0
for resp_list in gen:
response = resp_list[0]['response']
print(response[print_idx:], end='', flush=True)
print_idx = len(response)
print()
5.3 部署
CUDA_VISIBLE_DEVICES=0 swift deploy \
--ckpt_dir output/llama2-7b-chat/vx-xxx/checkpoint-xxx-peft \
--infer_backend vllm \
--vllm_enable_lora true
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default-lora",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2-7b-chat",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type_list = [model.id for model in client.models.list().data]
print(f'model_type_list: {model_type_list}')
query = 'who are you?'
messages = [{
'role': 'user',
'content': query
}]
resp = client.chat.completions.create(
model='default-lora',
messages=messages,
seed=42)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
stream_resp = client.chat.completions.create(
model='llama2-7b-chat',
messages=messages,
stream=True,
seed=42)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
总结
本文详细介绍了如何使用 MS-SWIFT 框架结合 VLLM 引擎进行大语言模型的推理加速与生产部署。通过上述步骤,开发者可以实现从环境配置、基础推理、流式输出到多卡部署及 LoRA 微调模型集成的完整工作流。VLLM 的高吞吐量特性显著降低了推理延迟,而 SWIFT 提供的统一接口简化了不同模型(如 Qwen、ChatGLM)的适配过程。在生产环境中,建议根据硬件资源合理配置 Tensor Parallelism,并利用 LoRA 技术实现高效的多租户模型管理。
相关免费在线工具
- 加密/解密文本
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
- RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
- Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
- 随机西班牙地址生成器
随机生成西班牙地址(支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选),支持数量快捷选择、显示全部与下载。 在线工具,随机西班牙地址生成器在线工具,online
- Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印,支持批量处理与下载。 在线工具,Gemini 图片去水印在线工具,online
- curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online