基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南 | 极客日志

PythonAI

基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南

基于 vLLM 0.7.1 部署 DeepSeek R1 模型涉及多机多卡环境配置，重点解决 Ray 集群自定义资源报错、OpenCV 版本冲突及 pynvml 类型错误等问题。通过设置环境变量或修改启动逻辑建立 Ray 集群，指定共享存储路径，最终通过 vllm serve 命令启动服务并验证推理输出。补充建议包括显存管理、网络带宽优化及日志监控策略，以提升生产环境推理效率与稳定性。

樱花落尽发布于 2025/2/6更新于 2026/6/221 浏览

基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南

vLLM 近期发布了 DeepSeek R1 的 PP（Pipeline Parallelism）支持，适合进行技术准备工作。假如训练的超大 MoE 上线了，也得做好技术准备工作。把踩坑经验给大家分享一下，希望能够相比于官方文档更白话一点。

Distributed Inference and Serving: https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes

Step 0 Prepare weights & Environment

由于权重太大了，即使你网速可以，也不建议直连下载了。大家可以先从 HF 及或代理弄一份权重回来，直连大概率直接超时或者把公网 IP 打爆。我们今天展示的多机多卡 8xH20 (x2) 部署，对应 TP size 8，PP size 2，所以要搞两台这样的机器过来。同时有一个假设：两机的网络互通，不一定需要 IB，储存需要共享（NAS 或 OSS 均可），完成准备工作之后便可以做第一步。

Step 1 Setup up Ray & Cluster

官方文档里面简单带过了这一部分，但这个是我被卡时间太久的问题。首先我说一下官方文档的意思，就是让你准备好两个节点，之间用 ray start 这个 CLI 去建立好 ray 集群。因为后面要用，但是比较坑的有两点，第一点是启动的命令似乎有点点问题，我在前几次尝试的时候都遇到了 Ray 的 autoscaler 报错的问题：

(autoscaler +1m19s) Error: No available node types can fulfill resource request {'node:33.18.26.153': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +1m54s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.
INFO 02-02 09:39:14 ray_utils.py:212] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:33.18.26.153': 0.001, 'GPU': 1.0}, ...]. Check `ray status` to see if you have enough resources.

这看起来就很奇怪，因为 vLLM 找 Ray 集群要的 Resource 是 custom resource，'node:33.18.26.153':0.001，这可以理解成 vLLM 优先要 driver 节点。但是这个东西我印象中是需要启动 ray 的时候自己设置的：

相关免费在线工具

RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online
Base64 文件转换器
将字符串、文件或图像转换为其 Base64 表示形式。在线工具，Base64 文件转换器在线工具，online

# Get local IP address and set on every node before Ray start
VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
export VLLM_HOST_IP

def get_actual_ip():
    """Get the actual IP address of the current machine."""
    try:
        # Create a socket to connect to an external server (doesn't actually connect)
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.connect(('8.8.8.8', 80))
        ip = s.getsockname()[0]
        s.close()
        return ip
    except Exception:
        # Fallback to hostname-based IP resolution
        return socket.gethostbyname(socket.gethostname())

def start_ray_cluster():
    free_ports = get_free_ports()
    port = free_ports[0]
    node_manager_port = free_ports[1]
    master_addr = get_master_addr()
    rank = get_rank()
    node_ip = get_actual_ip()  # Use the new function to get actual IP

    # Define custom resource based on node IP
    resource_spec = f'--resources=\'{"node:{node_ip}": 1}\''

    if rank == 0:
        cmd = f"ray start --head --port={port} --node-ip-address={master_addr} --node-manager-port {node_manager_port} --node-name={master_addr} {resource_spec}"
    else:
        cmd = f"ray start --address={master_addr}:{port} --node-manager-port {node_manager_port} --node-name={get_addr()} {resource_spec}"

    if ray.is_initialized():
        print("Ray is already initialized, skipping node level init.")
    else:
        stop_cmd = "ray stop"
        execute(stop_cmd, check=True)
        print(f"Executing Ray start command: {cmd}")
        execute(cmd, check=True)

import time
import subprocess

def execute(cmd, check=False, retry=1):
    ret = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=check)
    state = ret.returncode == 0
    msg = ret.stdout if state else ret.stderr
    if not state and retry > 1:
        print(f"execute {cmd} got error {msg}, retry...")
        time.sleep(1)
        return execute(cmd, check, retry-1)
    return state, msg

AttributeError: module 'cv2.dnn' has no attribute 'DictValue'

pip install opencv-python-headless==4.5.4.58

TypeError: a bytes-like object is required, not 'str'

pip install pynvml -U

vllm serve /your/path/to_checkpoint_deepseek-r1/ --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --host 0.0.0.0

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)

INFO 02-02 14:18:52 metrics.py:453] Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-02 14:19:07 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.

基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南

基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南

Step 0 Prepare weights & Environment

Step 1 Setup up Ray & Cluster

更多推荐文章

相关免费在线工具

Step 2 Other small bugs

Step 3 Run the model

补充建议

更多推荐文章

相关免费在线工具

基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南

基于 vLLM 0.7.1 部署 DeepSeek R1 模型避坑指南

Step 0 Prepare weights & Environment

Step 1 Setup up Ray & Cluster

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

Step 2 Other small bugs

Step 3 Run the model

补充建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具