企业私有 RAG 大模型构建：Qwen2.5 与 vLLM 部署实战

企业私有 RAG 大模型构建：Qwen2.5 与 vLLM 部署实战 | 极客日志

conda create -n vllm_qwen python=3.10
conda activate vllm_qwen
# 升级 pip
python -m pip install --upgrade pip
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install vllm
pip install modelscope[framework]

使用 ModelScope 下载 执行以下命令，将模型下载到 /qwen 目录下：
```
git clone https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git
```
使用 HF 国内镜像下载 执行以下命令，将模型下载到 /qwen 目录下：
```
git clone https://hf-mirror.com/Qwen/Qwen2.5-7B-Instruct
```

qwen/Qwen2.5-7B-Instruct/
|-- LICENSE
|-- README.md
|-- config.json
|-- configuration.json
|-- generation_config.json
|-- merges.txt
|-- model-00001-of-00004.safetensors
|-- model-00002-of-00004.safetensors
|-- model-00003-of-00004.safetensors
|-- model-00004-of-00004.safetensors
|-- model.safetensors.index.json
|-- tokenizer.json
|-- tokenizer_config.json
`-- vocab.json

/qwen/
|-- Qwen2.5-7B-Instruct/
|   |-- ... (模型文件)
`-- vllm_run.py

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 2048, 1
model_name = "./Qwen2.5-7B-Instruct"
prompt = [{"role": "user", "content": "你好，讲讲你是谁？"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

# export CUDA_VISIBLE_DEVICES=3，如果不指定卡号，默认使用 0 卡
python vllm-run.py

INFO 01-11 04:21:12 model_runner.py:1099] Loading model weights took 14.2487 GB
INFO 01-11 04:21:13 worker.py:241] Memory profiling takes 0.69 seconds
INFO 01-11 04:21:13 worker.py:241] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
INFO 01-11 04:21:13 worker.py:241] model weights take 14.25GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 5.51GiB.
INFO 01-11 04:21:13 gpu_executor.py:76] # GPU blocks: 6443, # CPU blocks: 4681
INFO 01-11 04:21:13 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 50.34x
INFO 01-11 04:21:17 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 4.89 seconds
Processed prompts: 100%|█████████████████████████████████████████| 1/1 [00:00<00:00,  1.01it/s, est. speed input: 36.52 toks/s, output: 53.76 toks/s]
你好！我是 Qwen，我是由阿里云开发的一种超大规模语言模型。我被设计用来回答问题、提供信息、参与对话，旨在帮助用户获得所需的知识和信息。如果你有任何问题或需要帮助，都可以尝试和我交流。

export CUDA_VISIBLE_DEVICES=3 // 指定 GPU 默是 0 卡
vllm serve Qwen2.5-7B-Instruct

curl http://localhost:8000/v1/models

{
	"object": "list",
	"data": [{
		"id": "Qwen2.5-7B-Instruct",
		"object": "model",
		"created": 1736570004,
		"owned_by": "vllm",
		"root": "Qwen2.5-7B-Instruct",
		"parent": null,
		"max_model_len": 32768,
		"permission": [{
			"id": "modelperm-62acae496e714754b5d8866fff32f0cb",
			"object": "model_permission",
			"created": 1736570004,
			"allow_create_engine": false,
			"allow_sampling": true,
			"allow_logprobs": true,
			"allow_search_indices": false,
			"allow_view": true,
			"allow_fine_tuning": false,
			"organization": "*",
			"group": null,
			"is_blocking": false
		}]
	}]
}

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "Qwen2.5-7B-Instruct",
    "messages": [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "告诉我一些关于大型语言模型的事情。"}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "repetition_penalty": 1.05,
    "max_tokens": 512
}'

{
	"id": "chatcmpl-c774bbba1c5c47579a77dec6ef87d987",
	"object": "chat.completion",
	"created": 1736570396,
	"model": "Qwen2.5-7B-Instruct",
	"choices": [{
		"index": 0,
		"message": {
			"role": "assistant",
			"content": "当然，我很乐意为您介绍一些关于大型语言模型的知识！...",
			"tool_calls": []
		},
		"logprobs": null,
		"finish_reason": "stop",
		"stop_reason": null
	}],
	"usage": {
		"prompt_tokens": 37,
		"total_tokens": 424,
		"completion_tokens": 387,
		"prompt_tokens_details": null
	},
	"prompt_logprobs": null
}

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3-pip
WORKDIR /app
COPY requirements.txt .
RUN pip install vllm torch transformers
COPY . .
CMD ["vllm", "serve", "Qwen2.5-7B-Instruct"]

企业私有 RAG 大模型构建：Qwen2.5 与 vLLM 部署实战

企业私有 RAG 大模型构建指南：Qwen2.5 与 vLLM 部署示例

背景与架构

Qwen2.5 介绍

vLLM 部署环境配置

硬件与环境建议

安装 vLLM

模型下载

代码准备与推理

简单示例 Python 文件

构建与 OpenAI 兼容的 API 服务

进阶部署与优化

Docker 容器化部署

多卡并行与负载均衡

RAG 集成注意事项

总结

更多推荐文章

相关免费在线工具

企业私有 RAG 大模型构建：Qwen2.5 与 vLLM 部署实战

企业私有 RAG 大模型构建指南：Qwen2.5 与 vLLM 部署示例

背景与架构

Qwen2.5 介绍

vLLM 部署环境配置

硬件与环境建议

安装 vLLM

模型下载

代码准备与推理

简单示例 Python 文件

构建与 OpenAI 兼容的 API 服务

进阶部署与优化

Docker 容器化部署

多卡并行与负载均衡

RAG 集成注意事项

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具