Windows 平台 Qwen1.5 大模型部署指南
本文介绍了在 Windows 电脑上如何部署 Qwen1.5 大模型,涉及 Python 环境配置、GPU 驱动设置及 API 服务化。
本文介绍了在 Windows 平台上部署 Qwen1.5 大模型的完整流程,涵盖 GPU 驱动配置、Anaconda 环境搭建、虚拟环境创建、依赖库安装及模型下载。内容包括使用 PyTorch 加载模型运行本地 Demo,以及基于 FastAPI 和 Uvicorn 构建支持流式输出的 RESTful API 接口。最后提供了常见问题的排查建议,帮助开发者快速完成大模型本地化部署与接口服务化。

本文介绍了在 Windows 电脑上如何部署 Qwen1.5 大模型,涉及 Python 环境配置、GPU 驱动设置及 API 服务化。
Anaconda 是一个用于科学计算的 Python 发行版,支持 Linux, Mac, Windows,包含众多流行的科学计算、数据分析包。Conda 是开源的包和环境管理器,可用于在同一机器上安装不同版本的软件包及其依赖,并能够在不同的环境之间切换。
由于 Conda 环境可以指定相应的 Python 版本,建议卸载全局安装的 Python 以避免冲突。
访问 Anaconda 官网,点击下载安装包链接。注意按时间倒排查看最新版本,选择符合我们 Windows 64 位版本下载。
安装过程通常只需点击下一步。针对 C 盘空间较小的情况,在选择安装盘时,建议选择 All Users 并更改到其他盘符。
回到刚才删除 Python 环境变量的 Path 界面,将 Anaconda3 的安装地址添加进去。
配置完环境变量后,重启电脑以确保生效。
按 Win + R,输入 cmd 确定进入命令行提示框,输入:
conda --version
此时能看到 conda 的版本号。
接入我们输入 python 进入 Python 交互环境:
python
同样能看到 Python 的版本号。可以通过 Ctrl+Z 再加 Enter 或者输入 exit() 来退出环境。
官方源在国外,下载缓慢且不稳定。建议更换成国内镜像。
.condarc 文件:conda config --set show_channel_urls yes
.condarc 文件打开,修改为以下内容:channels:
- defaults
show_channel_urls: true
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
deepmodeling: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/
conda clean -i 清除索引缓存,保证用的是镜像站提供的索引。conda create --name llm python=3.11.5
conda info --envs
可以看到创建的 llm 环境已在列表中。
3. 切换到创建的环境中:
conda activate llm
以 Qwen1.5-0.5B-Chat 为例,通过 ModelScope 提供的下载方式,选择 Git 方式下载。
git clone https://www.modelscope.cn/qwen/Qwen1.5-0.5B-Chat.git
将大模型下载到指定目录,例如 D:\大模型\Qwen1.5-0.5B-Chat。
先查一下 CUDA 版本。前面已装好 NVIDIA 驱动,通过桌面右键,选择'NVIDIA 控制面板'->左下角'系统信息'->切换为'组件'Tab 查看。
访问 PyTorch 官网找到相应的安装脚本。这里选择 PyTorch 2.1.0 版本,CUDA 选择 12.1 版本,与电脑实际版本接近。
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
我们需要安装 transformers 库:
conda install conda-forge::transformers
访问 VSCode 官网进行下载安装。
推荐安装 autoDocstring - Python Docstring Generator、autopep8。这样 VSCode 在编写 Python 代码时,可以提高代码质量和编写效率。
D:\大模型)。llm 环境。这样在 VSCode 中,就按照该环境来运行代码了。根据官方文档,在大模型同级目录下新建一个 qwen.py 文件,写入以下代码:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
"Qwen1.5-0.5B-Chat", # 修改大模型位置
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen1.5-0.5B-Chat") # 修改大模型位置
# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
# 改成中文提问
prompt = "给我简单介绍一下大型语言模型。"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# 打印一下助手回复的内容
print(response)
注:这里要修改对大模型的引入路径。
执行 py 文件:
python qwen.py
如果发现缺依赖,安装:
conda install conda-forge::accelerate
再重新执行,这时候就可以看到模型的回复了。
如果想要流式的,可以改为以下代码:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
device = "cuda" # the device to load the model onto
# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
"Qwen1.5-0.5B-Chat",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen1.5-0.5B-Chat")
# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
# 改成中文提问
prompt = "给我简单介绍一下大型语言模型。"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
streamer = TextIteratorStreamer(
tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
generated_text += new_text
print(generated_text)
借助 FastAPI 和 Uvicorn 来实现 API 接口的支持。
conda install fastapi uvicorn
web.py,写入代码:import uvicorn
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from argparse import ArgumentParser
app = FastAPI()
# 支持 CORS
app.add_middleware(
CORSMiddleware,
allow_origins=['*'],
allow_credentials=True,
allow_methods=['*'],
allow_headers=['*'],
)
@app.get("/")
async def index():
return {"message": "Hello World"}
def _get_args():
parser = ArgumentParser()
parser.add_argument('--server-port',
type=int,
default=8000,
help='Demo server port.')
parser.add_argument('--server-name',
type=str,
default='127.0.0.1',
help='Demo server name. Default: 127.0.0.1, which is only visible from the local computer.'
' If you want other computers to access your server, use 0.0.0.0 instead.',
)
args = parser.parse_args()
return args
if __name__ == '__main__':
args = _get_args()
uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1)
web.py:python web.py
请求接口,可以看到返回 hello world。
把前面 Qwen 的代码写进去:
from contextlib import asynccontextmanager
import torch
import uvicorn
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from argparse import ArgumentParser
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
@asynccontextmanager
async def lifespan(app: FastAPI): # collects GPU memory
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
# 支持 CORS
app.add_middleware(
CORSMiddleware,
allow_origins=['*'],
allow_credentials=True,
allow_methods=['*'],
allow_headers=['*'],
)
class ChatMessage(BaseModel):
role: Literal['user', 'assistant', 'system']
content: Optional[str]
class DeltaMessage(BaseModel):
role: Optional[Literal['user', 'assistant', 'system']] = None
content: Optional[str] = None
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessage]
stream: Optional[bool] = False
class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessage
finish_reason: Literal['stop', 'length']
class ChatCompletionResponseStreamChoice(BaseModel):
index: int
delta: DeltaMessage
finish_reason: Optional[Literal['stop', 'length']]
class ChatCompletionResponse(BaseModel):
model: str
object: Literal['chat.completion', 'chat.completion.chunk']
choices: List[Union[ChatCompletionResponseChoice,
ChatCompletionResponseStreamChoice]]
created: Optional[int] = Field(default_factory=lambda: int(time.time()))
@app.get("/")
async def index():
return {"message": "Hello World"}
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):
global model, tokenizer
# 简单的错误校验
if request.messages[-1].role != "user":
raise HTTPException(status_code=400, detail="Invalid request")
text = tokenizer.apply_chat_template(
request.messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True)[0]
choice_data = ChatCompletionResponseChoice(
index=0,
message=ChatMessage(role="assistant", content=response),
finish_reason="stop"
)
return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion")
def _get_args():
parser = ArgumentParser()
parser.add_argument(
'-c',
'--checkpoint-path',
type=str,
default='Qwen1.5-0.5B-Chat',
help='Checkpoint name or path, default to %(default)r',
)
parser.add_argument('--server-port',
type=int,
default=8000,
help='Demo server port.')
parser.add_argument('--server-name',
type=str,
default='127.0.0.1',
help='Demo server name. Default: 127.0.0.1, which is only visible from the local computer.'
' If you want other computers to access your server, use 0.0.0.0 instead.',
)
args = parser.parse_args()
return args
if __name__ == '__main__':
args = _get_args()
# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path)
uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1)
调用下接口,可以看到接口已经返回内容了。
需要安装 sse_starlette 的库,来支持流式的返回:
pip install sse_starlette
安装完把代码再改一下,通过参数 stream 来判断是否流式返回:
from contextlib import asynccontextmanager
from threading import Thread
import torch
import uvicorn
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, BatchEncoding
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from argparse import ArgumentParser
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
from sse_starlette.sse import EventSourceResponse
@asynccontextmanager
async def lifespan(app: FastAPI): # collects GPU memory
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
# 支持 CORS
app.add_middleware(
CORSMiddleware,
allow_origins=['*'],
allow_credentials=True,
allow_methods=['*'],
allow_headers=['*'],
)
class ChatMessage(BaseModel):
role: Literal['user', 'assistant', 'system']
content: Optional[str]
class DeltaMessage(BaseModel):
role: Optional[Literal['user', 'assistant', 'system']] = None
content: Optional[str] = None
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessage]
stream: Optional[bool] = False
class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessage
finish_reason: Literal['stop', 'length']
class ChatCompletionResponseStreamChoice(BaseModel):
index: int
delta: DeltaMessage
finish_reason: Optional[Literal['stop', 'length']]
class ChatCompletionResponse(BaseModel):
model: str
object: Literal['chat.completion', 'chat.completion.chunk']
choices: List[Union[ChatCompletionResponseChoice,
ChatCompletionResponseStreamChoice]]
created: Optional[int] = Field(default_factory=lambda: int(time.time()))
@app.get("/")
async def index():
return {"message": "Hello World"}
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):
global model, tokenizer
# 简单的错误校验
if request.messages[-1].role != "user":
raise HTTPException(status_code=400, detail="Invalid request")
text = tokenizer.apply_chat_template(
request.messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
if request.stream:
generate = predict(model_inputs, request.model)
return EventSourceResponse(generate, media_type="text/event-stream")
# Directly use generate() and tokenizer.decode() to get the output.
# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True)[0]
choice_data = ChatCompletionResponseChoice(
index=0,
message=ChatMessage(role="assistant", content=response),
finish_reason="stop"
)
return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion")
async def predict(model_inputs: BatchEncoding, model_id: str):
global model, tokenizer
streamer = TextIteratorStreamer(
tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(
model_inputs, streamer=streamer, max_new_tokens=512)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(role="assistant"),
finish_reason=None
)
chunk = ChatCompletionResponse(model=model_id, choices=[
choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
thread.start()
for new_text in streamer:
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(content=new_text),
finish_reason=None
)
chunk = ChatCompletionResponse(model=model_id, choices=[
choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(),
finish_reason="stop"
)
chunk = ChatCompletionResponse(model=model_id, choices=[
choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
yield '[DONE]'
def _get_args():
parser = ArgumentParser()
parser.add_argument(
'-c',
'--checkpoint-path',
type=str,
default='Qwen1.5-0.5B-Chat',
help='Checkpoint name or path, default to %(default)r',
)
parser.add_argument('--server-port',
type=int,
default=8000,
help='Demo server port.')
parser.add_argument('--server-name',
type=str,
default='127.0.0.1',
help='Demo server name. Default: 127.0.0.1, which is only visible from the local computer.'
' If you want other computers to access your server, use 0.0.0.0 instead.',
)
args = parser.parse_args()
return args
if __name__ == '__main__':
args = _get_args()
# Now you do not need to add "trust_remote_code=True"
model = AutoModelForCausalLM.from_pretrained(
args.checkpoint_path,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(args.checkpoint_path)
uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1)
再调用接口,流式返回也支持了。
batch_size 或加载量化版本模型(如 INT4)。conda list 查看当前环境包列表。--server-port 参数指定其他端口。完成上述步骤后,即可在本地 Windows 环境下成功运行 Qwen1.5 大模型并提供 API 服务。

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online