跳到主要内容Ubuntu 下 AMD AI MAX 395 使用 ROCm 部署 Qwen 模型 | 极客日志PythonAI算法
Ubuntu 下 AMD AI MAX 395 使用 ROCm 部署 Qwen 模型
在 Ubuntu 系统上基于 AMD AI MAX 395 硬件,通过安装 ROCm 7.0 驱动与配置 Docker 环境,实现本地部署 Qwen3-32B 对话模型、Embedding 模型及 Reranker 重排序服务。文章涵盖驱动安装、镜像离线导入、模型下载启动及自定义脚本构建流程,重点解决 ROCm 兼容性设置与 vLLM 不支持 Rerank 时的替代方案,提供完整的本地化大模型运行实践参考。
Ne01 浏览 Ubuntu 下 AMD AI MAX 395 使用 ROCm 部署 Qwen 模型
一、ROCm 7.0 驱动安装
首先确保系统已正确识别 GPU。官方安装指南如下:AMD ROCm 安装文档。如需其他版本,可查阅 AMD 仓库。
sudo apt update && sudo apt install wget -y
sudo apt autoremove amdgpu-dkms
sudo rm /etc/apt/sources.list.d/amdgpu.list
sudo rm -rf /var/cache/apt/*
sudo apt clean all
sudo apt update
wget https://repo.radeon.com/amdgpu-install/7.0.3/ubuntu/jammy/amdgpu-install_7.0.3.70003-1_all.deb
sudo apt install ./amdgpu-install_7.0.3.70003-1_all.deb
sudo apt install python3-setuptools python3-wheel
sudo usermod -aG render,video $LOGNAME
sudo apt install rocm
sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo apt install amdgpu-dkms
sudo usermod -aG render $USER
sudo usermod -aG video $USER
reboot
rocminfo | grep gfx
二、Docker 环境准备(vLLM)
1. 安装并配置 Docker
sudo apt update -y
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common lrzsz -y
sudo curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
sudo apt update -y
sudo apt-get install docker-ce -y
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": [
"https://docker.1panel.live",
"https://hub.rat.dev"
]
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker
2. 拉取 vLLM 镜像
如果主机无法联网,可先在有网机器上打包镜像到 U 盘,再导入本地。
2.1 打包镜像
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.2
docker save -o vllm_rocm7.tar rocm/vllm:rocm7.0.0_vllm_0.11.2
2.2 加载镜像
cp /media/user/MyUSB/vllm_rocm7.tar ~/
docker load -i vllm_rocm7.tar
docker images
三、千问模型部署
1. Qwen3-32B 对话模型
1.1 下载模型
mkdir -p /home/user/models/Qwen-32B-AWQ
export MODEL_DIR=/home/user/models/Qwen-32B-AWQ
pip3 install modelscope
python3 -c """
import os
from modelscope.hub.snapshot_download import snapshot_download
model_id = 'qwen/Qwen-32B-AWQ'
snapshot_download(
model_id=model_id,
cache_dir=os.environ.get('MODEL_DIR'),
revision='master'
)
"""
1.2 启动模型
注意:AMD Ryzen AI Max 395+ 的 GPU 代号为 gfx1151,但部分 ROCm 版本可能未完全支持,需通过环境变量强制指定。
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v /home/user/models/Qwen3-32B-AWQ/qwen/Qwen3-32B-AWQ:/model \
-e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210 \
vllm serve /model \
--quantization awq \
--dtype float16 \
--served-model-name Qwen3-32B-AWQ \
--trust-remote-code \
--max-model-len 8192
1.3 验证模型
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{ "model": "Qwen3-32B-AWQ", "prompt": "你是谁?", "max_tokens": 2000, "temperature": 0.7 }'
2. Qwen3-Embedding 向量模型
2.1 下载模型
python3 -c """
from modelscope import snapshot_download
snapshot_download('Qwen/Qwen3-Embedding-8B', cache_dir='/home/user/models')
"""
2.2 启动模型
docker run -d \
--name vllm-embedding \
--restart=always \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v /home/user/models/Qwen3-Embedding-8B:/model \
-e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210 \
vllm serve /model \
--port 8001 \
--task embed \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.4 \
--trust-remote-code \
--served-model-name qwen-embedding
2.3 验证模型
docker ps
docker logs vllm-embedding
curl http://localhost:8001/v1/embeddings \
-H "Content-Type: application/json" \
-d '{ "model": "qwen-embedding", "input": "你好,测试一下向量化服务" }'
3. Qwen3-Reranker 重排序模型
由于 vLLM 目前尚未原生支持 Rerank 任务,我们需要构建一个自定义服务。
3.1 下载模型
python3 -c """
from modelscope import snapshot_download
snapshot_download('Qwen/Qwen3-Reranker-8B', cache_dir='/home/user/models')
"""
3.2 配置启动脚本与 uv 管理
mkdir -p /home/user/qwen_project
cd /home/user/qwen_project
nano rerank_service.py
import torch
import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "/model"
PORT = 8002
app = FastAPI()
print(f"Loading model from {MODEL_PATH} ...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side='left', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).eval()
token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
max_length = 8192
print("Model loaded successfully!")
def format_instruction(instruction, query, doc):
if instruction is None or instruction == "":
instruction = 'Given a web search query, retrieve relevant passages that answer the query'
return "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction, query=query, doc=doc)
def process_inputs(pairs):
inputs = tokenizer(pairs, padding=False, truncation='longest_first', return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens))
for i, ele in enumerate(inputs['input_ids']):
inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
for key in inputs:
inputs[key] = inputs[key].to(model.device)
return inputs
@torch.no_grad()
def compute_scores(inputs):
batch_scores = model(**inputs).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
return scores
class RerankRequest(BaseModel):
model: str = "qwen-reranker"
query: str
documents: List[str]
top_n: Optional[int] = None
instruction: Optional[str] = None
@app.post("/v1/rerank")
async def rerank(request: RerankRequest):
try:
query = request.query
documents = request.documents
instruction = request.instruction
if not documents:
return {"results": []}
pairs = [format_instruction(instruction, query, doc) for doc in documents]
inputs = process_inputs(pairs)
scores = compute_scores(inputs)
results = []
for i, score in enumerate(scores):
results.append({"index": i, "relevance_score": float(score), "document": documents[i]})
results.sort(key=lambda x: x["relevance_score"], reverse=True)
if request.top_n:
results = results[:request.top_n]
return {"model": request.model, "results": results, "usage": {"total_tokens": inputs.input_ids.numel()}}
except Exception as e:
print(f"Error: {e}")
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=PORT)
[project]
name = "qwen3-reranker-service"
version = "0.1.0"
description = "Rerank service using Qwen3 and ROCm"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"transformers>=4.51.0",
"fastapi",
"uvicorn",
"modelscope",
"accelerate",
"pydantic"
]
注:此处不写 torch 依赖,直接使用系统自带的 AMD 版本。
3.3 构建镜像
docker run -it --name builder --network=host -v /home/user/qwen_project:/tmp_build rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210 bash
pip install uv -i https://pypi.tuna.tsinghua.edu.cn/simple
mkdir -p /app
cp /tmp_build/rerank_service.py /app/
cp /tmp_build/pyproject.toml /app/
cd /app
uv pip install --system -r pyproject.toml -i https://pypi.tuna.tsinghua.edu.cn/simple
pip list | grep transformers
exit
docker commit builder qwen-rerank:v1
docker rm builder
docker save -o qwen-rerank-v1.tar qwen-rerank:v1
3.4 启动与检验
docker run -d \
--name final_reranker \
--restart=always \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v /home/user/models/Qwen3-Reranker-8B:/model \
-e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
qwen-rerank:v1 \
python3 /app/rerank_service.py
curl http://localhost:8002/v1/rerank \
-H "Content-Type: application/json" \
-d '{ "model": "qwen-reranker", "query": "中国的首都在哪里?", "documents": [ "重力是万有引力。", "中国的首都是北京。", "香蕉很好吃。" ] }'
相关免费在线工具
- 加密/解密文本
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
- RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
- Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
- 随机西班牙地址生成器
随机生成西班牙地址(支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选),支持数量快捷选择、显示全部与下载。 在线工具,随机西班牙地址生成器在线工具,online
- Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印,支持批量处理与下载。 在线工具,Gemini 图片去水印在线工具,online
- curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online