使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型

综述由AI生成介绍使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型的方法。对比了 Ollama 与 llama.cpp 的差异，详细说明了 llama.cpp 的编译、配置及后台运行步骤。涵盖显存需求、上下文长度设置、推理模式开启与关闭以及工具调用测试。最后提供了服务停止与管理建议。

邪神洛基发布于 2026/4/5更新于 2026/5/2328 浏览

模型：Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF

显存：21~25GB max-model-len: 40960 并发：4

部署服务器：DGX-Spark-GB10 120GB 生成速率：13 tokens/s

部署 GGUF 格式的模型有 3 种方法

对比项	Ollama	llama.cpp	LM Studio/OpenWebUI
上手难度	⭐ 最简单	⭐⭐⭐ 需编译	⭐ 图形界面
推理性能	🔶 中等	🥇 最强	🔶 中等
GPU 控制	有限	完全可控	有限
API 服务	开箱即用	需手动启动	内置
适合场景	快速部署/生产	性能调优/研究	本地体验

第 1 种：使用 Ollama

前提：已经安装了 ollama

第一步：Huggingface 或 ModelScope 下载模型

git clone https://huggingface.co/TeichAI/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/tree/main

第二步：修改 Modelfile，使用 Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf 模型

FROM ./Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ .Response }}<|im_end|>""" PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER repeat_penalty 1.0

第三步：创建 ollama 实例

ollama create qwen3-claude-distill -f Modelfile

第四步：测试

注意：模型的思考模板有些问题，需要修改。

Ollama API 访问默认端口是 11434，直接用 curl 请求：

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{ "model": "qwen3-claude-distill", "messages": [ {"role": "user", "content": "你好，介绍一下你自己"} ], "stream": false }'

第 2 种：llama.cpp

第一步：下载 llama.cpp

git clone https://github.com/ggerganov/llama.cpp

第二步：GPU 构建

cd ./llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 8
# 结果将存于 ./build/bin/
rm -rf build # 构建失败可直接删除 build 目录即可
# 参数说明：
# -DLLAMA_BUILD_SERVER=ON 强制构建 llama-server
# -DGGML_CUDA=ON 启用 GPU
# Release 性能更好
./build/bin/llama-server --help # 验证安装成功

第三步：部署模型 (使用下载好的 gguf 模型)

简化命令：

./build/bin/llama-server \
  -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \
  -ngl 999 \
  -c 40960 \
  --host 0.0.0.0 \
  --port 8908

后台运行部署：

nohup ./build/bin/llama-server \
  -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \
  -ngl 999 \
  --batch-size 1024 \
  --threads 16 \
  --parallel 4 \
  --jinja \
  --reasoning-format deepseek \
  --reasoning-budget -1 \
  -c 40960 \
  --host 0.0.0.0 \
  --port 8908 \
  >> /home/admin/models/logs/llama_Qwen3-14B_Distill.log 2>&1 &
tail -f ~/models/logs/llama_Qwen3-14B_Distill.log # 查看日志

参数说明：

--n-gpu-layers：指定有多少 transformer 层放到 GPU 上执行
0 全部 CPU
20 前 20 层 GPU
999 尽可能全部 GPU
-c 40960：即--ctx-size，上下文长度（最大 token 数）
--host 0.0.0.0：是否可远程访问，使用此参数可以局域网访问
-port 8908：HTTP 监听端口
--threads 16: CPU 线程数量，但你只有 16 核，线程抢占反而性能下降
--batch-size 1024: GPU 每一步最多算多少 token
--parallel 4: 允许同时处理多少个请求（并发会话数）
--reasoning-format deepseek：思考模板
--reasoning-budget N：思考模式控制
值 含义
-1 不限制思考（默认，开启）
0 禁用思考模式
>0 限制思考 token 数量（部分模型支持）

重要提醒（关于 40K）

Qwen3-14B q8_0：

模型权重 ≈ 15~16GB
40K KV cache 可能占 10GB+
总显存可能 > 28GB 如果你 GPU 只有 24GB，会爆显存。

第四步：测试

对话端点

http://localhost:8908/v1/chat/completions
http://服务器 IP:8908/v1/chat/completions

默认开启思考模式

curl http://192.168.0.254:8908/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "介绍一下新加坡"} ], "temperature": 0.7, "max_tokens": 500 }'

关闭思考模式

curl http://192.168.0.254:8908/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "你是一个只回答用户问题的助手"}, {"role": "user", "content": "你好"} ], "temperature": 0.7, "max_tokens": 200, "chat_template_kwargs": { "enable_thinking": false } }'

工具的调用

curl http://192.168.0.254:8908/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-14B", "messages": [ { "role": "system", "content": "你是一个只回答用户问题的助手" }, { "role": "user", "content": "新加坡现在几点？" } ], "temperature": 0.7, "max_tokens": 200, "tools": [ { "type": "function", "function": { "name": "get_current_time", "description": "获取指定城市的当前时间", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "城市名称" } }, "required": ["city"] } } } ], "tool_choice": "auto", "chat_template_kwargs": { "enable_thinking": false } }'

🔥 nohup 服务停止

假设你这样启动：

nohup ./build/bin/llama-server ... > llama.log 2>&1 &

✅ 方法 1（推荐）

ps aux | grep llama-server # 显示 PID
kill -9 12345 # 杀掉进程

✅ 方法 2（最快）

pkill llama-server # 会杀掉所有 llama-server 进程

✅ 方法 3（精确杀端口）

如果你知道端口是 8908：

lsof -i:8908
kill 进程号

🏆 推荐做法（生产环境）使用：systemctl

管理服务，而不是 nohup

nohup	systemd
手动管理	自动重启
无状态管理	可开机启动
无健康检测	有状态监控

使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型

邪神洛基发布于 2026/4/5更新于 2026/5/2328 浏览

对比项

Ollama

llama.cpp

LM Studio/OpenWebUI

上手难度

⭐ 最简单

⭐⭐⭐ 需编译

⭐ 图形界面

推理性能

🔶 中等

🥇 最强

🔶 中等

GPU 控制

有限

完全可控

有限

API 服务

开箱即用

需手动启动

内置

适合场景

快速部署/生产

性能调优/研究

本地体验

FROM ./Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ .Response }}<|im_end|>""" PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER repeat_penalty 1.0

curl http://localhost:11434/api/chat \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-claude-distill", "messages": [ {"role": "user", "content": "你好，介绍一下你自己"} ], "stream": false }'

cd ./llama.cpp cmake -B build \ -DGGML_CUDA=ON \ -DLLAMA_BUILD_SERVER=ON \ -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j 8 # 结果将存于 ./build/bin/ rm -rf build # 构建失败可直接删除 build 目录即可 # 参数说明： # -DLLAMA_BUILD_SERVER=ON 强制构建 llama-server # -DGGML_CUDA=ON 启用 GPU # Release 性能更好 ./build/bin/llama-server --help # 验证安装成功

./build/bin/llama-server \ -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \ -ngl 999 \ -c 40960 \ --host 0.0.0.0 \ --port 8908

nohup ./build/bin/llama-server \ -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \ -ngl 999 \ --batch-size 1024 \ --threads 16 \ --parallel 4 \ --jinja \ --reasoning-format deepseek \ --reasoning-budget -1 \ -c 40960 \ --host 0.0.0.0 \ --port 8908 \ >> /home/admin/models/logs/llama_Qwen3-14B_Distill.log 2>&1 & tail -f ~/models/logs/llama_Qwen3-14B_Distill.log # 查看日志

--n-gpu-layers：指定有多少 transformer 层放到 GPU 上执行 0 全部 CPU 20 前 20 层 GPU 999 尽可能全部 GPU -c 40960：即--ctx-size，上下文长度（最大 token 数） --host 0.0.0.0：是否可远程访问，使用此参数可以局域网访问 -port 8908：HTTP 监听端口 --threads 16: CPU 线程数量，但你只有 16 核，线程抢占反而性能下降 --batch-size 1024: GPU 每一步最多算多少 token --parallel 4: 允许同时处理多少个请求（并发会话数） --reasoning-format deepseek：思考模板 --reasoning-budget N：思考模式控制值含义 -1 不限制思考（默认，开启） 0 禁用思考模式 >0 限制思考 token 数量（部分模型支持）

curl http://192.168.0.254:8908/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "介绍一下新加坡"} ], "temperature": 0.7, "max_tokens": 500 }'

curl http://192.168.0.254:8908/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "你是一个只回答用户问题的助手"}, {"role": "user", "content": "你好"} ], "temperature": 0.7, "max_tokens": 200, "chat_template_kwargs": { "enable_thinking": false } }'

curl http://192.168.0.254:8908/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-14B", "messages": [ { "role": "system", "content": "你是一个只回答用户问题的助手" }, { "role": "user", "content": "新加坡现在几点？" } ], "temperature": 0.7, "max_tokens": 200, "tools": [ { "type": "function", "function": { "name": "get_current_time", "description": "获取指定城市的当前时间", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "城市名称" } }, "required": ["city"] } } } ], "tool_choice": "auto", "chat_template_kwargs": { "enable_thinking": false } }'

nohup

systemd

手动管理

自动重启

无状态管理

可开机启动

无健康检测

有状态监控

使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型

第 1 种：使用 Ollama

第 2 种：llama.cpp

第四步：测试

🔥 nohup 服务停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用：systemctl

使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型

第 1 种：使用 Ollama

第 2 种：llama.cpp

第四步：测试

🔥 nohup 服务停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用：systemctl

更多推荐文章

相关免费在线工具

更多推荐文章

相关免费在线工具

使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型

第 1 种：使用 Ollama

第 2 种：llama.cpp

第四步：测试

🔥 nohup 服务停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用：systemctl

使用 llama.cpp 部署 Qwen3-14B Reasoning Distill GGUF 模型

第 1 种：使用 Ollama

第 2 种：llama.cpp

第四步：测试

🔥 nohup 服务停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用：systemctl

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具