Llama.cpp 框架入门与 C/C++ 编码实战

一、Llama.cpp 框架简介

llama.cpp是由 Georgi Gerganov 创建的轻量级推理引擎，基于 C/C++ 语言实现，支持大模型的训练和推理，专注于在本地硬件环境（如个人电脑、树莓派等）上高效运行 LLM 模型。

目前支持的大模型包括 LLaMA 系列、Qwen 系列、Gemma 系列、LLaVA 系列等。该框架支持运行在 CPU、GPU、嵌入式设备上，对消费级硬件和资源受限的边缘计算设备支持较好。

由于实现代码以 C/C++ 为主，具备跨平台兼容性（支持 Windows、Linux、macOS），并支持 Python 调用（如 llama-cpp-python 第三方库）。Georgi Gerganov 是 AI 开源社区知名开发者，此前开发了 whisper.cpp 和 ggml 张量库。2023 年初 Meta 发布 LLaMA 模型后，他将其移植到 ggml 中，实现了基于 CPU 本地运行大模型的推理和训练，促成了 llama.cpp 的诞生。

应用场景：

边缘人工智能：在计算能力有限的设备（如智能手机、物联网设备）上部署 LLaMA 或类似 GPT 的模型。
模型研究：研究人员可快速迭代量化模型，无需资源密集型硬件。
离线推理：将服务部署在本地离线环境，确保数据隐私和商业机密。

二、关于 GGUF 模型文件

llama.cpp 不能直接使用 PyTorch 或 TensorFlow 生成的原始模型文件，主要使用 GGUF（GGML Unified Format）格式存储权重和配置。早期采用 GGML 格式，后改为 GGUF。

早期文件格式（如 .pb, .pt, .pth）侧重于未量化模型，存在不支持量化、内存占用高等局限性。GGML 允许以低精度格式（INT8, INT4）存储权重，减少大小和内存占用。GGUF 在此基础上优化，包含更多元数据，设计更易于扩展，集成权重、分词器和元数据，加载速度快且跨平台。

核心特性：

硬件兼容性强：支持在消费级硬件（如 CPU）上运行，无需昂贵 GPU。
内存占用低：通过量化显著减少内存占用。
支持自定义量化：微调性能与精度平衡。
模型可扩展性强：支持大规模 LLM 模型。
易于部署：适用于云服务器、边缘设备和移动端。
丰富的元数据信息：存储层级结构、配置参数和量化级别等。

获取和转换：

开发者可使用 convert.py 脚本将原始 PyTorch 模型转换为 GGUF 格式：

python convert.py path/to/model --outtype gguf --outfile model.gguf

#include "arg.h" #include "common.h" #include "log.h" #include "llama.h" #include <algorithm> #include <cstdio> #include <string> #include <vector> int main(int argc, char ** argv) { common_params params; params.model.path = "/home/dev/LLM/llama.cpp/models/Qwen3-0.6B-Q8_0.gguf"; params.prompt = "What is LLM?"; params.n_predict = 128; common_init(); int n_parallel = params.n_parallel; int n_predict = params.n_predict; llama_backend_init(); llama_numa_init(params.numa); llama_model_params model_params = common_model_params_to_llama(params); llama_model *model = llama_model_load_from_file(params.model.path.c_str(), model_params); if (model == NULL) { LOG_ERR("%s: error: unable to load model\n", __func__); return 1; } const llama_vocab * vocab = llama_model_get_vocab(model); std::vector<llama_token> tokens_list; tokens_list = common_tokenize(vocab, params.prompt, true); const int n_kv_req = tokens_list.size() + (n_predict - tokens_list.size())*n_parallel; llama_context_params ctx_params = common_context_params_to_llama(params); ctx_params.n_ctx = n_kv_req; ctx_params.n_batch = std::max(n_predict, n_parallel); llama_context * ctx = llama_init_from_model(model, ctx_params); auto sparams = llama_sampler_chain_default_params(); sparams.no_perf = false; llama_sampler * smpl = llama_sampler_chain_init(sparams); llama_sampler_chain_add(smpl, llama_sampler_init_top_k(params.sampling.top_k)); llama_sampler_chain_add(smpl, llama_sampler_init_top_p(params.sampling.top_p, params.sampling.min_keep)); llama_sampler_chain_add(smpl, llama_sampler_init_temp(params.sampling.temp)); llama_sampler_chain_add(smpl, llama_sampler_init_dist(params.sampling.seed)); if (ctx == NULL) { LOG_ERR("%s: error: failed to create the llama_context\n", __func__); return 1; } const int n_ctx = llama_n_ctx(ctx); LOG_INF("\n%s: n_predict = %d, \n_ctx = %d, \nn_batch = %u, \nn_parallel = %d, \nn_kv_req = %d\n", __func__, n_predict, n_ctx, ctx_params.n_batch, n_parallel, n_kv_req); if (n_kv_req > n_ctx) { LOG_ERR("the required KV cache size is not big enough\n"); return 1; } LOG("\n"); for (auto id : tokens_list) { LOG("%s", common_token_to_piece(ctx, id).c_str()); } llama_batch batch = llama_batch_init(std::max(tokens_list.size(), (size_t) n_parallel), 0, n_parallel); std::vector<llama_seq_id> seq_ids(n_parallel, 0); for (int32_t i = 0; i < n_parallel; ++i) { seq_ids[i] = i; } for (size_t i = 0; i < tokens_list.size(); ++i) { common_batch_add(batch, tokens_list[i], i, seq_ids, false); } GGML_ASSERT(batch.n_tokens == (int) tokens_list.size()); if (llama_model_has_encoder(model)) { if (llama_encode(ctx, batch)) { LOG_ERR("%s : failed to eval\n", __func__); return 1; } llama_token decoder_start_token_id = llama_model_decoder_start_token(model); if (decoder_start_token_id == LLAMA_TOKEN_NULL) { decoder_start_token_id = llama_vocab_bos(vocab); } common_batch_clear(batch); common_batch_add(batch, decoder_start_token_id, 0, seq_ids, false); } batch.logits[batch.n_tokens - 1] = true; if (llama_decode(ctx, batch) != 0) { LOG_ERR("%s: llama_decode() failed\n", __func__); return 1; } if (n_parallel > 1) { LOG("\n\n%s: generating %d sequences ...\n", __func__, n_parallel); } std::vector<std::string> streams(n_parallel); std::vector<int32_t> i_batch(n_parallel, batch.n_tokens - 1); int n_cur = batch.n_tokens; int n_decode = 0; const auto t_main_start = ggml_time_us(); while (n_cur <= n_predict) { common_batch_clear(batch); for (int32_t i = 0; i < n_parallel; ++i) { if (i_batch[i] < 0) continue; const llama_token new_token_id = llama_sampler_sample(smpl, ctx, i_batch[i]); if (llama_vocab_is_eog(vocab, new_token_id) || n_cur == n_predict) { i_batch[i] = -1; LOG("\n"); if (n_parallel > 1) { LOG_INF("%s: stream %d finished at n_cur = %d", __func__, i, n_cur); } continue; } if (n_parallel == 1) { LOG("%s", common_token_to_piece(ctx, new_token_id).c_str()); } streams[i] += common_token_to_piece(ctx, new_token_id); i_batch[i] = batch.n_tokens; common_batch_add(batch, new_token_id, n_cur, { i }, true); n_decode += 1; } if (batch.n_tokens == 0) break; n_cur += 1; if (llama_decode(ctx, batch)) { LOG_ERR("%s : failed to eval, return code %d\n", __func__, 1); return 1; } } if (n_parallel > 1) { LOG("\n"); for (int32_t i = 0; i < n_parallel; ++i) { LOG("sequence %d:\n\n%s%s\n\n", i, params.prompt.c_str(), streams[i].c_str()); } } const auto t_main_end = ggml_time_us(); LOG_INF("%s: decoded %d tokens in %.2f s, speed: %.2f t/s\n", __func__, n_decode, (t_main_end - t_main_start) / 1000000.0f, n_decode / ((t_main_end - t_main_start) / 1000000.0f)); LOG("\n"); llama_perf_sampler_print(smpl); llama_perf_context_print(ctx); fprintf(stderr, "\n"); llama_batch_free(batch); llama_sampler_free(smpl); llama_free(ctx); llama_model_free(model); llama_backend_free(); return 0; }

Llama.cpp 框架入门与 C/C++ 编码实战

一、Llama.cpp 框架简介

二、关于 GGUF 模型文件

更多推荐文章

相关免费在线工具

三、LLM 模型的量化

四、llama.cpp 命令行

五、llama.cpp 本地部署实战

六、llama.cpp Python 编码实战

七、llama.cpp C++ 编码实战

更多推荐文章

相关免费在线工具

Llama.cpp 框架入门与 C/C++ 编码实战

一、Llama.cpp 框架简介

二、关于 GGUF 模型文件

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

三、LLM 模型的量化

四、llama.cpp 命令行

五、llama.cpp 本地部署实战

六、llama.cpp Python 编码实战

七、llama.cpp C++ 编码实战

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具