新版 llama.cpp 使用指南及 Llama 模型本地部署 | 极客日志

C++AI算法

新版 llama.cpp 使用指南及 Llama 模型本地部署

如何使用 llama.cpp 在本地部署 Llama 大模型。内容包括环境搭建（编译 CUDA 支持）、模型格式转换（pth/hf 转 gguf）、量化处理、命令行交互及 API 服务启动，最后通过 Open WebUI 实现类 ChatGPT 的聊天界面。适合希望离线运行大模型的开发者参考。

菩提发布于 2026/4/6更新于 2026/7/2568 浏览

简介

最近大模型的发展日新月异。这一次我们来看一下使用 llama.cpp 这个项目，其主要解决的是推理过程中的性能问题。主要有两点优化：

llama.cpp 使用的是 C 语言写的机器学习张量库 ggml
llama.cpp 提供了模型量化的工具

此项目的优势在于没有 GPU 也能跑 LLaMA 模型。llama.cpp 是一个不同的生态系统，具有不同的设计理念，旨在实现轻量级、最小外部依赖、多平台以及广泛灵活的硬件支持：

纯粹的 C/C++ 实现，没有外部依赖
支持广泛的硬件：
- x86_64 CPU 的 AVX、AVX2 和 AVX512 支持
- 通过 Metal 和 Accelerate 支持 Apple Silicon（CPU 和 GPU）
- NVIDIA GPU（通过 CUDA）、AMD GPU（通过 hipBLAS）、Intel GPU（通过 SYCL）、昇腾 NPU（通过 CANN）和摩尔线程 GPU（通过 MUSA）
- GPU 的 Vulkan 后端
多种量化方案以加快推理速度并减少内存占用
CPU+GPU 混合推理，以加速超过总 VRAM 容量的模型

llama.cpp 提供了大模型量化的工具，可以将模型参数从 32 位浮点数转换为 16 位浮点数，甚至是 8、4 位整数。除此之外，llama.cpp 还提供了服务化组件，可以直接对外提供模型的 API。

1. llama.cpp 环境安装

克隆仓库并进入该目录：

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

构建 GPU 执行环境，确保安装 CUDA 工具包，适用于有 GPU 的操作系统。

如果 CUDA 设置正确，那么执行 nvidia-smi、nvcc --version 没有错误提示，则表示一切设置正确。

mkdir build
sudo apt-get install make cmake gcc g++ locate cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j4
cd build
make install

在当前版本（截至 2024 年 11 月 10 日）这些指令分别被重命名为 llama-quantize、llama-cli、llama-server。

ln -s your/path/to/llama.cpp/build/bin/llama-quantize llama-quantize
ln -s your/path/to/llama.cpp/build/bin/llama-server llama-server
ln -s your/path/to/llama.cpp/build/bin/llama-cli llama-cli

2. LLAMA 模型转换

这里我们会从 pth 开始，一步步给出我们怎么将模型应用到 llama.cpp 中的。

2.1 pth 原始模型处理

首先安装高版本 python 3.10。

pip install protobuf==3.20.0
pip install transformers #最新版
pip install sentencepiece #（0.1.97 测试通过）
pip install peft #（0.2.0 测试通过）
pip install git+https://github.com/huggingface/transformers
pip install sentencepiece
pip install peft

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
Base64 字符串编码/解码
将字符串编码和解码为其 Base64 格式表示形式即可。在线工具，Base64 字符串编码/解码在线工具，online

sudo apt update
sudo apt install transmission-cli "magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA"

├── llama-7b
│   ├── consolidated.00.pth
│   ├── params.json
│   └── checklist.chk
└── tokenizer.model

pip3 install transformers pyllama -U

python3 -m llama.download --model_size 7B

ImportError: dlopen(/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so, 0x0002): tried: '/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so'(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so'(no such file), '/Library/Python/3.9/site-packages/_itree.cpython-39-darwin.so'(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))

brew install cmake
pip3 install https://github.com/juncongmoo/itree/archive/refs/tags/v0.0.18.tar.gz

pip3 uninstall pyllama
git clone https://github.com/juncongmoo/pyllama
pip3 install -e pyllama

python3 -m llama.download --model_size 7B
❤️ Resume download is supported. You can ctrl-c and rerun the program to resume the downloading
Downloading tokenizer...
✅ pyllama_data/tokenizer.model
✅ pyllama_data/tokenizer_checklist.chk
tokenizer.model: OK
Downloading 7B
downloading file to pyllama_data/7B/consolidated.00.pth ...please wait for a few minutes ...
✅ pyllama_data/7B/consolidated.00.pth
✅ pyllama_data/7B/params.json
✅ pyllama_data/7B/checklist.chk
Checking checksums for the 7B model
consolidated.00.pth: OK
params.json: OK

#!/bin/bash
# Function to handle stopping the script
function stop_script(){
    echo "Stopping the script."
    exit 0
}
# Register the signal handler
trap stop_script SIGINT
while true; do
    # Run the command with a timeout of 200 second
    timeout 2000 python -m llama.download --model_size $1 --folder model
    echo "restart download"
    sleep 1
    # Wait for 1 second before starting the next iteration
    # Wait for any key to be pressed within a 1-second timeout
    read -t 1 -n 1 -s key
    if [[ $key ]]; then
        stop_script
    fi
done

bash llama_download.sh 7B

pyllama_data
|-- 7B
|   |-- checklist.chk
|   |-- consolidated.00.pth
|   `-- params.json
|-- tokenizer.model
`-- tokenizer_checklist.chk

git clone https://huggingface.co/luodian/llama-7b-hf ./models/Llama-7b-chat-hf

git clone https://github.com/huggingface/transformers.git
cd transformers
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /workspace/pth_model/7B \
--model_size 7B \
--output_dir /workspace/hf_data

config.json
generation_config.json
pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin
pytorch_model.bin.index.json
special_tokens_map.json
tokenizer_config.json
tokenizer.json
tokenizer.model

git clone https://github.com/ymcui/Chinese-LLaMA-Alpaca.git

python scripts/merge_llama_with_chinese_lora.py \
--base_model /workspace/hf_data \
--lora_model /workspace/chinese_llama_lora_7b \
--output_dir /workspace/lora_pth_data

# 请参考并替换为自己的对应路径，记得创建/workspace/chinese_gguf/llama-7b.gguf 这个文件。其中 outtype 是指下面的量化精度，其实不需要转，可以使用下面的指令转
python convert_hf_to_gguf.py ../hf_data --outfile /workspace/chinese_gguf/llama-7b.gguf --outtype q8_0
#如果是 pth 则是
python3 examples/convert_legacy_llama.py /workspace/lora_pth_data/ --outfile /workspace/chinese_gguf/chinese.gguf

# llama-quantize 提供各种精度的量化
#./llama-quantize
#usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] model-f32.gguf [model-quant.gguf] type [nthreads]
# --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
# --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
# Allowed quantization types:
# 2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
# 3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
# 8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
# 9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
# 10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
# 12 or Q3_K : alias for Q3_K_M
# 11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
# 12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
# 13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
# 15 or Q4_K : alias for Q4_K_M
# 14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
# 15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
# 17 or Q5_K : alias for Q5_K_M
# 16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
# 17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
# 18 or Q6_K : 5.15G, -0.0008 ppl @ LLaMA-v1-7B
# 7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
# 1 or F16 : 13.00G @ 7B
# 0 or F32 : 26.00G @ 7B
# 2. 使用 llama-quantize 转换精度
# llama-quantize 支持的精度以及更多的使用方法可通过 llama-quantize --help 查看
llama-quantize /workspace/chinese_gguf/chinese.gguf /workspace/chinese_gguf/chinese_q4_0.gguf Q4_0

ls -lh /workspace/chinese_gguf

git clone https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF ./models/LLaMA-2-7B-32K-Instruct_GGUF

llama-cli -m chinese_q4_0.gguf -p "you are a helpful assistant" -cnv -ngl 24

./llama-server -m /mnt/workspace/my-llama-13b-q4_0.gguf -ngl 28

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size =1002.00 MiB
llm_load_tensors: CUDA0 buffer size =14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx =512
llama_new_context_with_model: n_batch =512
llama_new_context_with_model: n_ubatch =512
llama_new_context_with_model: flash_attn =0

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "What color is the sun?","n_predict": 512}'
{"content":".....","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,......}}

pip install openai

import openai
client = openai.OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="sk-no-key-required")
completion = client.chat.completions.create(
    model="qwen",# model name can be chosen arbitrarily
    messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"tell me something about michael jordan"}])
print(completion.choices[0].message.content)

pip install llama-cpp-python
pip install llama-cpp-python -i https://mirrors.aliyun.com/pypi/simple/

pip install sse_starlette starlette_context pydantic_settings

python -m llama_cpp.server --model models/Llama3-q8.gguf

$ docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

新版 llama.cpp 使用指南及 Llama 模型本地部署

简介

1. llama.cpp 环境安装

2. LLAMA 模型转换

2.1 pth 原始模型处理

更多推荐文章

相关免费在线工具

2.1.1 磁链下载

2.1.2 使用 pyllama 下载

2.1.3 脚本下载

2.2 原版转为 hf 格式文件

2.2.1 hf 格式转换

2.2.2 合并 lora

2.3 hf 转 gguf 模型

3. 使用 llama.cpp 运行 GGUF 模型

3.1 交互模式

3.2 模型 API 服务

3.3 模型 API 服务 (第三方，自己安装不需要)

4. 实现类似 ChatGPT 的聊天应用

5. 参考链接

更多推荐文章

相关免费在线工具

新版 llama.cpp 使用指南及 Llama 模型本地部署

简介

1. llama.cpp 环境安装

2. LLAMA 模型转换

2.1 pth 原始模型处理

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2.1.1 磁链下载

2.1.2 使用 pyllama 下载

2.1.3 脚本下载

2.2 原版转为 hf 格式文件

2.2.1 hf 格式转换

2.2.2 合并 lora

2.3 hf 转 gguf 模型

3. 使用 llama.cpp 运行 GGUF 模型

3.1 交互模式

3.2 模型 API 服务

3.3 模型 API 服务 (第三方，自己安装不需要)

4. 实现类似 ChatGPT 的聊天应用

5. 参考链接

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具