基于 llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-Distill-GGUF 模型 | 极客日志

Shell / BashAI算法

基于 llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-Distill-GGUF 模型

基于 llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-Distill-GGUF 模型。对比 Ollama 与 llama.cpp 方案，后者性能更强且 GPU 可控。需 21-25GB 显存支持 40K 上下文。演示构建、启动、API 测试及工具调用，含思考模式配置。提供后台运行与进程管理方案，适用于高性能本地推理。

清心发布于 2026/4/8更新于 2026/7/2037 浏览

模型：Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF

"model": "Qwen3-14B"

显存：21~25GB

max-model-len：40960

并发：4

部署服务器：DGX-Spark-GB10 120GB

生成速率：13 tokens/s

部署GGUF 格式的模型有 3 种方法

对比项	Ollama	llama.cpp	LM Studio/OpenWebUI
上手难度	⭐ 最简单	⭐⭐⭐ 需编译	⭐ 图形界面
推理性能	🔶 中等	🥇 最强	🔶 中等
GPU 控制	有限	完全可控	有限
API 服务	开箱即用	需手动启动	内置
适合场景	快速部署/生产	性能调优/研究	本地体验

第 1 种：使用Ollama

前提：已经安装了 ollama

第一步：Huggingface 下载模型

https://huggingface.co/TeichAI/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/tree/main

第二步：修改 Modelfile:使用 Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf 模型

FROM ./Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ .Response }}<|im_end|>""" PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER repeat_penalty 1.0

第三步：创建 ollama 实例

ollama create qwen3-claude-distill -f Modelfile

第四步：测试

注意：模型的思考模板有些问题'\u003cthink\u003e\n'，需要修改

# Ollama API 访问
# 默认端口是 11434
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{ "model": "qwen3-claude-distill", "messages": [ {"role": "user", "content": "你好，介绍一下你自己"} ], "stream": false }'

响应示例包含 reasoning_content 字段。

第 2 种：

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

git clone https://github.com/ggerganov/llama.cpp

cd ./llama.cpp cmake -B build \
  -DGGML_CUDA=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DCMAKE_BUILD_TYPE=Release 
cmake --build build --config Release -j 8
# 结果将存于 ./build/bin/ 
# 参数说明：
# -DLLAMA_BUILD_SERVER=ON 强制构建 llama-server
# -DGGML_CUDA=ON 启用 GPU
# Release 性能更好
# 验证安装成功
./build/bin/llama-server --help

# 简化命令
./build/bin/llama-server \
  -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \
  -ngl 999 \
  -c 40960 \
  --host 0.0.0.0 \
  --port 8908

nohup ./build/bin/llama-server \
  -m /home/admin/models/huggingface/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF/Qwen3-14B-Claude-4.5-Opus-Distill.q8_0.gguf \
  -ngl 999 \
  --batch-size 1024 \
  --threads 16 \
  --parallel 4 \
  --jinja \
  --reasoning-format deepseek \
  --reasoning-budget -1 \
  -c 40960 \
  --host 0.0.0.0 \
  --port 8908 \
  >> /home/admin/models/logs/llama_Qwen3-14B_Distill.log 2>&1 &
# 查看日志
tail -f ~/models/logs/llama_Qwen3-14B_Distill.log

--n-gpu-layers：指定有多少 transformer 层放到 GPU 上执行
0 全部 CPU
20 前 20 层 GPU
999 尽可能全部 GPU
-c 40960: 即--ctx-size，上下文长度（最大 token 数）
--host 0.0.0.0：是否可远程访问：使用此参数，可以局域网可访问
-port 8908：HTTP 监听端口
--threads 16:CPU 线程数量
--batch-size 1024: GPU 每一步最多算多少 token
--parallel 4:允许同时处理多少个请求（并发会话数）
--reasoning-format deepseek：思考模板
--reasoning-budget N：思考模式控制
值 含义
-1 不限制思考（默认，开启）
0 禁用思考模式
>0 限制思考 token 数量（部分模型支持）

http://localhost:8908/v1/chat/completions
http://服务器 IP:8908/v1/chat/completions

# 请求 (默认开启思考模式)
curl http://192.168.0.254:8908/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "介绍一下新加坡"} ], "temperature": 0.7, "max_tokens": 500 }'

# 关闭思考模式的请求
curl http://192.168.0.254:8908/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "你是一个只回答用户问题的助手"}, {"role": "user", "content": "你好"} ], "temperature": 0.7, "max_tokens": 200, "chat_template_kwargs": { "enable_thinking": false } }'

# 请求
curl http://192.168.0.254:8908/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-14B", "messages": [ {"role": "system", "content": "你是一个只回答用户问题的助手"}, {"role": "user", "content": "新加坡现在几点？"} ], "temperature": 0.7, "max_tokens": 200, "tools": [ {"type": "function", "function": {"name": "get_current_time", "description": "获取指定城市的当前时间", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "城市名称"}}, "required": ["city"]}}}], "tool_choice": "auto", "chat_template_kwargs": { "enable_thinking": false } }'

nohup ./build/bin/llama-server ... > llama.log 2>&1 &

1 查看进程 ps aux | grep llama-server
---显示---- admin 12345 ...
12345 就是 PID。
2 杀掉进程 kill -9 12345

pkill llama-server

lsof -i:8908 kill 进程号

nohup	systemd
手动管理	自动重启
无状态管理	可开机启动
无健康检测	有状态监控

基于 llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-Distill-GGUF 模型

第 1 种：使用Ollama

第 2 种：

更多推荐文章

相关免费在线工具

第四步：测试

🔥 nohup 服务停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用：systemctl

更多推荐文章

相关免费在线工具

基于 llama.cpp 部署 Qwen3-14B-Claude-4.5-Opus-Distill-GGUF 模型

第 1 种：使用Ollama

第 2 种：

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

第四步：测试

🔥 nohup 服务 停止

✅ 方法 1（推荐）

✅ 方法 2（最快）

✅ 方法 3（精确杀端口）

🏆 推荐做法（生产环境）使用：systemctl

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

🔥 nohup 服务停止