Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署 | 极客日志

PythonAI算法

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署

综述由AI生成在 Windows 环境下编译 llama.cpp 工具并部署 Qwen 大模型的完整流程。内容包括使用 Git 克隆源码，通过 CMake 配置 CPU 或 GPU 加速编译选项，从 ModelScope 下载 GGUF 格式的 Qwen-7B-Chat 模型，以及启动本地 API 服务。此外，文章提供了多种 Python 调用示例，涵盖基础文本生成、多轮对话、上下文记忆保持及函数工具调用测试，帮助开发者快速实现本地化 AI 应用集成。

星云发布于 2026/4/6更新于 2026/5/2230 浏览

概述

在大模型落地场景中，本地轻量化部署因低延迟、高隐私性、无需依赖云端算力等优势，成为开发者与 AI 爱好者的热门需求。本文聚焦 Windows 10/11（64 位）环境，详细拆解 llama.cpp 工具的编译流程（支持 CPU/GPU 双模式，GPU 加速需依赖 NVIDIA CUDA），并指导如何下载 GGUF 格式的 Qwen-7B-Chat 模型，最终实现模型本地启动与 API 服务搭建。

1. 克隆代码

打开管理员权限的 PowerShell/CMD，执行以下命令克隆代码：

git clone https://github.com/ggml-org/llama.cpp
mkdir build
cd build

2. 基础编译

仅 CPU 支持

cmake .. -G "Visual Studio 18 2026" -A x64 -DLLAMA_CURL=OFF
cmake --build . --config Release

GPU 加速编译（已安装 CUDA Toolkit）

添加 -DLLAMA_CUDA=ON 开启 GPU 支持：

cmake .. -G "Visual Studio 18 2026" -A x64 -DLLAMA_CUDA=ON
cmake --build . --config Release

3. 下载模型

下载 GGUF 格式的 Qwen 模型（以 7B 为例）：

pip install modelscope
modelscope download --model Xorbits/Qwen-7B-Chat-GGUF

下载后的保存位置通常为 \modelscope\hub\models\Xorbits。

4. 启动服务

运行模型启动 API 服务（支持 HTTP 调用）。注意端口配置需统一，本示例使用 11433。

CPU 版

chcp 65001
llama-cli.exe -m qwen.gguf -i -c 4096

GPU 加速版

llama-server.exe -m qwen-7b-chat.Q4_0.gguf -c 4096 --n-gpu-layers -1 --host 127.0.0.1 --port 11433

5. 测试调用

服务启动后监听 http://localhost:11433，可通过 curl 测试调用效果。

curl http://localhost:11433/completion -H "Content-Type": application/json -d '{ "prompt": "你好，介绍一下通义千问", "temperature": 0.7, "max_tokens": 512 }'

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

import requests
import json

url = "http://localhost:11433/completion"
headers = {"Content-Type": "application/json"}
data = {
    "model": "qwen.gguf",
    "prompt": "你好，请用 100 字介绍一下通义千问",
    "temperature": 0.7,
    "max_tokens": 512,
    "ctx_size": 4096,
    "stop": ["<|im_end|"]
}
try:
    response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60)
    response.raise_for_status()
    result = response.json()
    print("生成结果：")
    print(result["content"])
except Exception as e:
    print(f"调用失败：{e}")

import requests
import json

chat_history = []
url = "http://localhost:11433/chat/completions"
headers = {"Content-Type": "application/json"}

def chat_with_model(prompt):
    chat_history.append({"role": "user", "content": prompt})
    data = {
        "model": "qwen.gguf",
        "messages": chat_history,
        "temperature": 0.7,
        "max_tokens": 512
    }
    try:
        response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60)
        response.raise_for_status()
        result = response.json()
        answer = result["choices"][0]["message"]["content"]
        chat_history.append({"role": "assistant", "content": answer})
        return answer
    except Exception as e:
        return f"调用失败：{e}"

print("开始多轮对话（输入'退出'结束）：")
while True:
    user_input = input("你：")
    if user_input == "退出":
        break
    answer = chat_with_model(user_input)
    print(f"助手：{answer}\n")

import requests
import json
import re

chat_history = [{"role": "system", "content": "你是一个有帮助的助手，必须记住之前的对话内容，基于上下文回答用户问题。"}]
url = "http://localhost:11433/chat/completions"
headers = {"Content-Type": "application/json"}

def clean_pad_content(content):
    return re.sub(r'\[PAD\d+\]', '', content).strip()

def chat_with_model(prompt):
    global chat_history
    chat_history.append({"role": "user", "content": prompt})
    data = {
        "model": "qwen.gguf",
        "messages": chat_history,
        "temperature": 0.7,
        "max_tokens": 512,
        "stream": False,
        "stop": ["[PAD]"]
    }
    try:
        response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60)
        response.raise_for_status()
        result = response.json()
        if "choices" in result and len(result["choices"]) > 0:
            choice = result["choices"][0]
            if "message" in choice and "content" in choice["message"]:
                raw_answer = choice["message"]["content"]
                answer = clean_pad_content(raw_answer)
                chat_history.append({"role": "assistant", "content": answer})
                return answer
        return f"返回格式异常"
    except requests.exceptions.ConnectionError:
        return "连接失败：请检查本地服务是否在 11433 端口运行"
    except requests.exceptions.Timeout:
        return "请求超时：模型响应过慢"
    except Exception as e:
        return f"调用失败：{str(e)}"

print("开始多轮对话（输入'退出'结束）：")
while True:
    user_input = input("你：")
    if user_input.strip() == "退出":
        break
    if not user_input.strip():
        print("助手：请输入有效内容！\n")
        continue
    answer = chat_with_model(user_input)
    print(f"助手：{answer}\n")

import requests
import json
import re
from datetime import datetime

# 定义可用工具集
def get_current_time():
    current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    return f"当前时间为：{current_time}"

def calculate_add(a: float, b: float):
    return f"{a} + {b} = {a + b}"

tool_registry = {
    "get_current_time": {
        "function": get_current_time,
        "description": "获取当前的本地时间，无需参数",
        "parameters": {}
    },
    "calculate_add": {
        "function": calculate_add,
        "description": "计算两个数字的加法，需要两个参数：a（数字）、b（数字）",
        "parameters": {
            "a": {"type": "float", "required": True, "description": "加数 1"},
            "b": {"type": "float", "required": True, "description": "加数 2"}
        }
    }
}

chat_history = [{"role": "system", "content": "你是一个有帮助的助手，必须记住之前的对话内容，基于上下文回答用户问题。你可以调用以下工具来辅助回答：1. get_current_time：获取当前的本地时间，无需参数 2. calculate_add：计算两个数字的加法，需要参数 a 和 b（均为数字）。如果需要调用工具，请严格按照以下 JSON 格式返回（仅返回 JSON，不要加其他内容）：{"name": "工具名", "parameters": {"参数名": 参数值}}。如果不需要调用工具，直接回答用户问题即可，不要返回 JSON 格式。"}]

url = "http://localhost:11433/chat/completions"
headers = {"Content-Type": "application/json"}

def clean_pad_content(content):
    return re.sub(r'\[PAD\d+\]', '', content).strip()

def parse_tool_call(content):
    try:
        json_match = re.search(r'\{[\s\S]*\}', content)
        if not json_match:
            return None
        tool_call = json.loads(json_match.group())
        if "name" in tool_call and "parameters" in tool_call:
            return tool_call
        return None
    except (json.JSONDecodeError, Exception):
        return None

def execute_tool(tool_call):
    tool_name = tool_call["name"]
    parameters = tool_call.get("parameters", {})
    if tool_name not in tool_registry:
        return f"错误：不存在名为 {tool_name} 的工具，可用工具：{list(tool_registry.keys())}"
    tool_info = tool_registry[tool_name]
    tool_func = tool_info["function"]
    tool_params = tool_info["parameters"]
    missing_params = []
    for param_name, param_info in tool_params.items():
        if param_info.get("required") and param_name not in parameters:
            missing_params.append(param_name)
    if missing_params:
        return f"错误：调用 {tool_name} 缺少必填参数：{', '.join(missing_params)}"
    try:
        for param_name, param_info in tool_params.items():
            if param_name in parameters:
                param_type = param_info.get("type", "str")
                if param_type == "float":
                    parameters[param_name] = float(parameters[param_name])
                elif param_type == "int":
                    parameters[param_name] = int(parameters[param_name])
    except ValueError as e:
        return f"错误：参数类型转换失败 - {str(e)}"
    try:
        result = tool_func(**parameters)
        return f"工具调用成功（{tool_name}）：{result}"
    except Exception as e:
        return f"错误：执行 {tool_name} 失败 - {str(e)}"

def chat_with_model(prompt):
    global chat_history
    chat_history.append({"role": "user", "content": prompt})
    data = {
        "model": "qwen.gguf",
        "messages": chat_history,
        "temperature": 0.7,
        "max_tokens": 512,
        "stream": False,
        "stop": ["[PAD]"]
    }
    try:
        response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60)
        response.raise_for_status()
        result = response.json()
        if "choices" in result and len(result["choices"]) > 0 and "message" in result["choices"][0]:
            raw_answer = result["choices"][0]["message"]["content"]
            clean_answer = clean_pad_content(raw_answer)
        else:
            return f"返回格式异常"
        
        tool_call = parse_tool_call(clean_answer)
        if tool_call:
            print(f"检测到工具调用：{json.dumps(tool_call, ensure_ascii=False)}")
            tool_result = execute_tool(tool_call)
            print(f"工具执行结果：{tool_result}")
            chat_history.append({"role": "assistant", "content": f"工具调用结果：{tool_result}"})
            second_response = requests.post(url, headers=headers, data=json.dumps(data), timeout=60)
            second_response.raise_for_status()
            second_result = second_response.json()
            if "choices" in second_result and len(second_result["choices"]) > 0 and "message" in second_result["choices"][0]:
                final_answer = clean_pad_content(second_result["choices"][0]["message"]["content"])
                chat_history.append({"role": "assistant", "content": final_answer})
                return final_answer
        else:
            chat_history.append({"role": "assistant", "content": clean_answer})
            return clean_answer
    except requests.exceptions.ConnectionError:
        return "连接失败：请检查本地服务是否在 11433 端口运行"
    except requests.exceptions.Timeout:
        return "请求超时：模型响应过慢"
    except Exception as e:
        return f"调用失败：{str(e)}"

if __name__ == "__main__":
    print("开始多轮对话（输入'退出'结束）：")
    print("测试工具调用示例：")
    print("1. 现在几点了？（调用获取时间工具）")
    print("2. 计算 123+456 等于多少？（调用加法工具）")
    print("3. 我的名字是李四，我叫什么？（测试上下文记忆）\n")
    while True:
        user_input = input("你：")
        if user_input.strip() == "退出":
            break
        if not user_input.strip():
            print("助手：请输入有效内容！\n")
            continue
        answer = chat_with_model(user_input)
        print(f"助手：{answer}\n")

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署

概述

1. 克隆代码

2. 基础编译

仅 CPU 支持

GPU 加速编译（已安装 CUDA Toolkit）

3. 下载模型

4. 启动服务

CPU 版

GPU 加速版

5. 测试调用

更多推荐文章

相关免费在线工具

6. 工具测试

基础非流式调用（completion 端点）

多轮对话示例（基于 chat/completions）

带有对话记忆功能测试

函数工具调用测试

更多推荐文章

相关免费在线工具

Windows 环境下 llama.cpp 编译与 Qwen 模型本地部署

概述

1. 克隆代码

2. 基础编译

仅 CPU 支持

GPU 加速编译（已安装 CUDA Toolkit）

3. 下载模型

4. 启动服务

CPU 版

GPU 加速版

5. 测试调用

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

6. 工具测试

基础非流式调用（completion 端点）

多轮对话示例（基于 chat/completions）

带有对话记忆功能测试

函数工具调用测试

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具