在 Android 设备上利用 Termux 安装 llama.cpp 并启动 WebUI

介绍如何在 Android 设备的 Termux 环境中通过 apt 安装 llama-cpp 包，下载 Qwen3.5 量化模型，并使用 llama-cli 进行命令行交互测试。随后演示如何启动 llama-server 内置的 WebUI 服务，通过浏览器访问本地端口进行对话。全程使用 CPU 后端运行，适合移动端轻量级 LLM 部署。

小熊软糖发布于 2026/4/6更新于 2026/7/1952 浏览

llama.cpp 没有发布官方 aarch64 的二进制，需要自己编译，好在 Termux 已经有编译好的包可用。

在 Termux 中安装 llama-cpp 软件

~ $ apt update
Get:1 https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable InRelease [14.0 kB]
...
Reading package lists... Done

~ $ apt install llama-cpp
The following additional packages will be installed:
libandroid-spawn
Suggested packages: llama-cpp-backend-vulkan llama-cpp-backend-opencl
The following NEW packages will be installed:
libandroid-spawn llama-cpp
...
Setting up llama-cpp (0.0.0-b8184-0) ...

如果找不到这个包，就先执行 apt update 更新目录。为简单起见，先不安装 llama-cpp-backend-vulkan，用 CPU 来执行 llama-cpp。

下载 Qwen3.5-0.8B-UD-Q4_K_XL.gguf 模型

~/model $ mkdir model
~/model $ cd model
~/model $ curl -LO https://hf-mirror.com/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-UD-Q4_K_XL.gguf
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
100 532M 100 532M 0 0 4147k 0 0:02:11 0:02:11 --:--:-- 5141k

这个模型是 Q4 量化的，比原版减少了一半空间，而能力差不多。

用 llama-cli 交互工具加载模型并对话

~/model $ llama-cli -m Qwen3.5-0.8B-UD-Q4_K_XL.gguf --ctx-size 16384 -cnv
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
Loading model...
build : b0-unknown
model : Qwen3.5-0.8B-UD-Q4_K_XL.gguf
modalities : text
available commands:
/ex or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

用因式分解法解方程 x^2-x-20=0

我们要解的是方程：x^2 - x - 20 = 0

1. 分析方程结构

这是一个一元二次方程。观察常数项是 -20，一次项系数是 1。我们需要找到两个数，乘积为 -20，和为 1。这两个数显然是 4 和 -5。所以，我们可以将方程左边变形为：(x - 4)(x + 5) = x^2 + x - 20

2. 验证并求解

根据零乘性（Zero Product Property），如果两个因式的乘积为 0，那么其中一个因式必须为 0。 x - 4 = 0 或 x + 5 = 0 解得：x1 = 4, x2 = -5

3. 结论

方程的解为：x = 4 或 x = -5

[ Prompt: 45.1 t/s | Generation: 6.6 t/s ]

/exit Exiting... llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |

Host | 1222 = 522 + 211 + 489 |

因为模型很小，智能比较弱，胡说一通后，勉强算对了。

利用 llama-server 内置的 web-ui 功能

~/model $ ls -l
total 546220
-rw------- 1 u0_a270 u0_a270 558772480 Mar 8 09:40 Qwen3.5-0.8B-UD-Q4_K_XL.gguf

~/model $ llama-server -m ./Qwen3.5-0.8B-UD-Q4_K_XL.gguf --jinja -c 0 --host 127.0.0.1 --port 8033
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 0 (unknown) with Clang 21.0.0 for Android aarch64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8
init: using 7 threads for HTTP server
start: binding port with default address family
srv load_model: loading model './Qwen3.5-0.8B-UD-Q4_K_XL.gguf'
common_init_result: fitting params to device memory
llama_params_fit_impl: no devices with dedicated memory found
llama_params_fit: successfully fit params to free device memory
srv init: init: chat template, example_format: '<|im_start|>system You are a helpful assistant<|im_end|>'
srv load_model: prompt cache is enabled, size limit: 8192 MiB
main: model loaded
main: server is listening on http://127.0.0.1:8033
main: starting the main loop...
srv update_slots: all slots are idle

系统检测到 CPU 有 8 个线程，用了 7 个，输出一堆参数后等待用浏览器访问 http://127.0.0.1:8033。

在浏览器中输入问题，输出速度比命令行慢一些，大约 3t/s。

服务端输出如下内容：

srv log_server_r: done request: GET / 127.0.0.1 200
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 262144
slot print_timing: id 3 | task 0 | prompt eval time = 1447.31 ms / 23 tokens (62.93 ms per token, 15.89 tokens per second)
eval time = 171453.86 ms / 569 tokens (301.32 ms per token, 3.32 tokens per second)
total time = 172901.17 ms / 592 tokens

至此，WebUI 服务已正常运行。

在 Android 设备上利用 Termux 安装 llama.cpp 并启动 WebUI

小熊软糖发布于 2026/4/6更新于 2026/7/1952 浏览

~ $ apt update Get:1 https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable InRelease [14.0 kB] ... Reading package lists... Done ~ $ apt install llama-cpp The following additional packages will be installed: libandroid-spawn Suggested packages: llama-cpp-backend-vulkan llama-cpp-backend-opencl The following NEW packages will be installed: libandroid-spawn llama-cpp ... Setting up llama-cpp (0.0.0-b8184-0) ...

~/model $ mkdir model ~/model $ cd model ~/model $ curl -LO https://hf-mirror.com/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-UD-Q4_K_XL.gguf % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 532M 100 532M 0 0 4147k 0 0:02:11 0:02:11 --:--:-- 5141k

~/model $ llama-cli -m Qwen3.5-0.8B-UD-Q4_K_XL.gguf --ctx-size 16384 -cnv load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so Loading model... build : b0-unknown model : Qwen3.5-0.8B-UD-Q4_K_XL.gguf modalities : text available commands: /ex or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read add a text file

~/model $ ls -l total 546220 -rw------- 1 u0_a270 u0_a270 558772480 Mar 8 09:40 Qwen3.5-0.8B-UD-Q4_K_XL.gguf ~/model $ llama-server -m ./Qwen3.5-0.8B-UD-Q4_K_XL.gguf --jinja -c 0 --host 127.0.0.1 --port 8033 load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true build: 0 (unknown) with Clang 21.0.0 for Android aarch64 system info: n_threads = 8, n_threads_batch = 8, total_threads = 8 init: using 7 threads for HTTP server start: binding port with default address family srv load_model: loading model './Qwen3.5-0.8B-UD-Q4_K_XL.gguf' common_init_result: fitting params to device memory llama_params_fit_impl: no devices with dedicated memory found llama_params_fit: successfully fit params to free device memory srv init: init: chat template, example_format: '<|im_start|>system You are a helpful assistant<|im_end|>' srv load_model: prompt cache is enabled, size limit: 8192 MiB main: model loaded main: server is listening on http://127.0.0.1:8033 main: starting the main loop... srv update_slots: all slots are idle

srv log_server_r: done request: GET / 127.0.0.1 200 slot launch_slot_: id 3 | task 0 | processing task, is_child = 0 slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 262144 slot print_timing: id 3 | task 0 | prompt eval time = 1447.31 ms / 23 tokens (62.93 ms per token, 15.89 tokens per second) eval time = 171453.86 ms / 569 tokens (301.32 ms per token, 3.32 tokens per second) total time = 172901.17 ms / 592 tokens

在 Android 设备上利用 Termux 安装 llama.cpp 并启动 WebUI

1. 分析方程结构

2. 验证并求解

3. 结论

在 Android 设备上利用 Termux 安装 llama.cpp 并启动 WebUI

1. 分析方程结构

2. 验证并求解

3. 结论

更多推荐文章

相关免费在线工具

更多推荐文章

相关免费在线工具

在 Android 设备上利用 Termux 安装 llama.cpp 并启动 WebUI

1. 分析方程结构

2. 验证并求解

3. 结论

在 Android 设备上利用 Termux 安装 llama.cpp 并启动 WebUI

1. 分析方程结构

2. 验证并求解

3. 结论

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具