Windows 下 llama.cpp 编译与 Qwen 模型本地部署实战

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署

在大模型落地场景中，本地轻量化部署因低延迟、高隐私性、无需依赖云端算力等优势，成为开发者与 AI 爱好者的热门需求。本文聚焦 Windows 10/11（64 位）环境，详细拆解 llama.cpp 工具的编译流程（支持 CPU/GPU 双模式），并指导如何通过 ModelScope 下载 GGUF 格式的 Qwen-7B-Chat 模型，最终实现模型本地启动与 API 服务搭建。

1. 准备编译环境

首先，以管理员身份打开 PowerShell 或 CMD，克隆 llama.cpp 源码：

git clone https://github.com/ggml-org/llama.cpp
mkdir build
cd build

2. 配置与编译

仅 CPU 模式

如果不需要 GPU 加速，使用 Visual Studio 进行基础编译：

cmake .. -G "Visual Studio 17 2022" -A x64 -DLLAMA_CURL=OFF
cmake --build . --config Release

GPU 加速模式

若已安装 CUDA Toolkit，可开启 GPU 支持以提升推理速度：

cmake .. -G "Visual Studio 17 2022" -A x64 -DLLAMA_CUDA=ON
cmake --build . --config Release

注意：Visual Studio 版本建议为 2022 (VS 17)，旧版本可能缺少必要的 C++ 标准库支持。

3. 获取模型文件

通过 ModelScope 下载 GGUF 格式的 Qwen 模型（以 7B 为例）。确保已安装 modelscope 库：

pip install modelscope
modelscope download --model Xorbits/Qwen-7B-Chat-GGUF

下载后的模型通常保存在 \modelscope\hub\models\Xorbits 目录下，请根据实际路径调整后续命令中的 -m 参数。

4. 启动 API 服务

进入编译生成的 bin\Release 目录，运行 llama-server.exe 启动 HTTP 服务。这里统一使用 OpenAI 兼容端口 11433，以便与后续 Python 脚本对接。

CPU 版启动

llama-server.exe -m qwen.gguf --host 127.0.0.1 --port 11433 -c 4096

GPU 加速版启动

llama-server.exe -m qwen-7b-chat.Q4_0.gguf -c 4096 --n-gpu-layers -1

服务启动后默认监听 http://localhost:11433。

5. 接口测试与调用

基础非流式调用

使用 curl 快速验证服务连通性：

curl http://localhost:11433/completion -H "Content-Type: application/json" -d '{ "prompt": "你好，介绍一下通义千问", "temperature": 0.7, "max_tokens": 512 }'

import requests import json import re from datetime import datetime # ====================== 1. 定义可用工具集 ====================== def get_current_time(): """获取当前的本地时间""" current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S") return f"当前时间为：{current_time}" def calculate_add(a: float, b: float): """计算两个数的加法结果""" return f"{a} + {b} = {a + b}" # 工具注册表（核心：映射工具名到函数和描述） tool_registry = { "get_current_time": { "function": get_current_time, "description": "获取当前的本地时间，无需参数", "parameters": {} }, "calculate_add": { "function": calculate_add, "description": "计算两个数字的加法，需要参数 a 和 b", "parameters": { "a": {"type": "float", "required": True, "description": "加数 1"}, "b": {"type": "float", "required": True, "description": "加数 2"} } } } # ====================== 2. 初始化对话历史和基础配置 ====================== chat_history = [{"role": "system", "content": """你是一个有帮助的助手，必须记住之前的对话内容，基于上下文回答用户问题。你可以调用以下工具来辅助回答： 1. get_current_time：获取当前的本地时间 2. calculate_add：计算两个数字的加法如果需要调用工具，请严格按照 JSON 格式返回。"""}] url = "http://localhost:11433/chat/completions" headers = {"Content-Type": "application/json"} def clean_pad_content(content): """过滤模型返回的 [PAD...] 垃圾字符""" return re.sub(r'\[PAD\d+\]', '', content).strip() def parse_tool_call(content): """解析模型返回的内容，提取工具调用指令""" try: json_match = re.search(r'\{[\s\S]*\}', content) if not json_match: return None tool_call = json.loads(json_match.group()) if "name" in tool_call and "parameters" in tool_call: return tool_call return None except (json.JSONDecodeError, Exception): return None def execute_tool(tool_call): """执行工具调用，返回执行结果""" tool_name = tool_call["name"] parameters = tool_call.get("parameters", {}) if tool_name not in tool_registry: return f"错误：不存在名为 {tool_name} 的工具" tool_info = tool_registry[tool_name] tool_func = tool_info["function"] tool_params = tool_info["parameters"] # 参数类型转换 for param_name, param_info in tool_params.items(): if param_name in parameters: param_type = param_info.get("type", "str") if param_type == "float": parameters[param_name] = float(parameters[param_name]) elif param_type == "int": parameters[param_name] = int(parameters[param_name]) try: result = tool_func(**parameters) return f"工具调用成功（{tool_name}）：{result}" except Exception as e: return f"错误：执行 {tool_name} 失败 - {str(e)}" # ====================== 3. 核心对话函数 ====================== def chat_with_model(prompt): global chat_history chat_history.append({"role": "user", "content": prompt}) data = { "model": "qwen.gguf", "messages": chat_history, "temperature": 0.7, "max_tokens": 512, "stream": False, "stop": ["[PAD"] } try: response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60) response.raise_for_status() result = response.json() if "choices" in result and len(result["choices"]) > 0: raw_answer = result["choices"][0]["message"]["content"] clean_answer = clean_pad_content(raw_answer) else: return f"返回格式异常" # 解析是否包含工具调用指令 tool_call = parse_tool_call(clean_answer) if tool_call: print(f"📢 检测到工具调用：{json.dumps(tool_call, ensure_ascii=False)}") tool_result = execute_tool(tool_call) print(f"🔧 工具执行结果：{tool_result}") # 将工具执行结果加入对话历史 chat_history.append({"role": "assistant", "content": f"工具调用结果：{tool_result}"}) # 再次请求生成最终回答 second_response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60) second_result = second_response.json() final_answer = clean_pad_content(second_result["choices"][0]["message"]["content"]) chat_history.append({"role": "assistant", "content": final_answer}) return final_answer else: chat_history.append({"role": "assistant", "content": clean_answer}) return clean_answer except requests.exceptions.ConnectionError: return "连接失败：请检查本地服务是否在 11433 端口运行" except requests.exceptions.Timeout: return "请求超时：模型响应过慢" except Exception as e: return f"调用失败：{str(e)}" # ====================== 4. 测试入口 ====================== if __name__ == "__main__": print("开始多轮对话（输入'退出'结束）：") print("📌 测试工具调用示例：") print(" 1. 现在几点了？") print(" 2. 计算 123+456 等于多少？") print(" 3. 我的名字是李四，我叫什么？\n") while True: user_input = input("你：") if user_input.strip() == "退出": break if not user_input.strip(): print("助手：请输入有效内容！\n") continue answer = chat_with_model(user_input) print(f"助手：{answer}\n")

Windows 下 llama.cpp 编译与 Qwen 模型本地部署实战

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署

1. 准备编译环境

2. 配置与编译

仅 CPU 模式

GPU 加速模式

3. 获取模型文件

4. 启动 API 服务

CPU 版启动

GPU 加速版启动

5. 接口测试与调用

基础非流式调用

更多推荐文章

相关免费在线工具

Python 客户端集成

基础对话与多轮记忆

工具函数调用（Function Calling）

更多推荐文章

相关免费在线工具

Windows 下 llama.cpp 编译与 Qwen 模型本地部署实战

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署

1. 准备编译环境

2. 配置与编译

仅 CPU 模式

GPU 加速模式

3. 获取模型文件

4. 启动 API 服务

CPU 版启动

GPU 加速版启动

5. 接口测试与调用

基础非流式调用

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

Python 客户端集成

基础对话与多轮记忆

工具函数调用（Function Calling）

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具