在Android设备上利用Termux安装llama.cpp并启动webui

llama.cpp没有发布官方aarch64的二进制,需要自己编译,好在Termux已经有编译好的包可用。

按照文章在安卓手机上用vulkan加速推理LLM的方法,
1.在Termux中安装llama-cpp软件

~ $ apt install llama-cpp Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to locate package llama-cpp ~ $ apt update Get:1 https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable InRelease [14.0 kB] Get:2 https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable/main aarch64 Packages [542 kB] Fetched 556 kB in 1s (425 kB/s) Reading package lists... Done Building dependency tree... Done Reading state information... Done 83 packages can be upgraded. Run 'apt list --upgradable' to see them. ~ $ apt install llama-cpp Reading package lists... Done Building dependency tree... Done Reading state information... Done The following additional packages will be installed: libandroid-spawn Suggested packages: llama-cpp-backend-vulkan llama-cpp-backend-opencl The following NEW packages will be installed: libandroid-spawn llama-cpp 0 upgraded, 2 newly installed, 0 to remove and 83 not upgraded. Need to get 9927 kB of archives. After this operation, 99.2 MB of additional disk space will be used. Do you want to continue? [Y/n] Get:1 https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable/main aarch64 libandroid-spawn aarch64 0.3 [15.2 kB] Get:2 https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable/main aarch64 llama-cpp aarch64 0.0.0-b8184-0 [9911 kB] Fetched 9927 kB in 2s (4059 kB/s) Selecting previously unselected package libandroid-spawn. (Reading database ... 6651 files and directories currently installed.) Preparing to unpack .../libandroid-spawn_0.3_aarch64.deb ... Unpacking libandroid-spawn (0.3) ... Selecting previously unselected package llama-cpp. Preparing to unpack .../llama-cpp_0.0.0-b8184-0_aarch64.deb ... Unpacking llama-cpp (0.0.0-b8184-0) ... Setting up libandroid-spawn (0.3) ... Setting up llama-cpp (0.0.0-b8184-0) ... 

如果找不到这个包,就先执行apt update更新目录。为简单起见,先不安装llama-cpp-backend-vulkan,用cpu来执行llama-cpp。
2.下载Qwen3.5-0.8B-UD-Q4_K_XL.gguf模型

~ $ mkdir model ~ $ cd model ~/model $ wget -c https://hf-mirror.com/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-UD-Q4_K_XL.gguf The program wget is not installed. Install it by executing: pkg install wget ~/model $ curl -LO https://hf-mirror.com/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-UD-Q4_K_XL.gguf % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1391 0 1391 0 0 1771 0 --:--:-- --:--:-- --:--:-- 1771 100 532M 100 532M 0 0 4147k 0 0:02:11 0:02:11 --:--:-- 5141k 

这个模型是Q4量化的,比原版减少了一半空间,而能力差不多。

3.用lama-cli交互工具加载模型并对话

~/model $ lama-cli -m Qwen3.5-0.8B-UD-Q4_K_XL.gguf --ctx-size 16384 -cnv No command lama-cli found, did you mean: Command alass-cli in package alass Command ani-cli in package ani-cli ~/model $ llama-cli -m Qwen3.5-0.8B-UD-Q4_K_XL.gguf --ctx-size 16384 -cnv load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b0-unknown model : Qwen3.5-0.8B-UD-Q4_K_XL.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read add a text file 
用因式分解法解方程x^2-x-20=0

我们要解的是方程:
x 2 − x − 20 = 0 x^2 - x - 20 = 0 x2−x−20=0

1. 分析方程结构

这是一个一元二次方程,可以看作是配方法(或平方差公式)的逆运算。
观察方程 x 2 − 2 x + 9 − x = 0 x^2 - 2x + 9 - x = 0 x2−2x+9−x=0,这看起来不像标准形式,但我们可以尝试配方:
x 2 − 20 + x = 0 x^2 - 20 + x = 0 x2−20+x=0
这也不对,因为系数没有统一。让我们尝试直接利用平方差公式 a 2 − b 2 = ( a − b ) ( a + b ) a^2 - b^2 = (a-b)(a+b) a2−b2=(a−b)(a+b)。

我们需要将方程写成 ( x − … ) ( x + … ) = 0 (x - \dots)(x + \dots) = 0 (x−…)(x+…)=0 的形式。
通常这类方程(一次项和常数项互为相反数的形式)可以通过配方转化为:
( x − 1 2 ) 2 − ( 1 2 ) 2 − 20 = 0 (x - \frac{1}{2})^2 - (\frac{1}{2})^2 - 20 = 0 (x−21​)2−(21​)2−20=0
这会导致 x x x 变成四次方程,显然不是我们想要的。

让我们回到直接因式分解
观察常数项是 − 20 -20 −20,一次项系数是 1 1 1。
我们需要找到两个数,乘积为 − 20 -20 −20,和为 1 1 1。
这两个数显然是 4− 5 -5 −5

所以,我们可以将方程左边变形为:
( x − 4 ) ( x + 5 ) = x 2 − 4 x + 5 x − 20 = x 2 + x − 20 (x - 4)(x + 5) = x^2 - 4x + 5x - 20 = x^2 + x - 20 (x−4)(x+5)=x2−4x+5x−20=x2+x−20

2. 验证并求解

让我们重新检查一下刚才的变形是否正确:
( x − 4 ) ( x + 5 ) = x 2 + 5 x − 4 x − 20 = x 2 + x − 20 (x - 4)(x + 5) = x^2 + 5x - 4x - 20 = x^2 + x - 20 (x−4)(x+5)=x2+5x−4x−20=x2+x−20
这与原方程完全一致。

因此,原方程可以分解为:
x 2 + x − 20 = 0 x^2 + x - 20 = 0 x2+x−20=0

根据零乘性(Zero Product Property),如果两个因式的乘积为 0,那么其中一个因式必须为 0。

所以:
x − 4 = 0 或 x + 5 = 0 x - 4 = 0 \quad \text{或} \quad x + 5 = 0 x−4=0或x+5=0

解得:
x 1 = 4 x_1 = 4 x1​=4
x 2 = − 5 x_2 = -5 x2​=−5

3. 结论

方程的解为:
x = 4  或  x = − 5 x = 4 \text{ 或 } x = -5 x=4 或 x=−5

[ Prompt: 45.1 t/s | Generation: 6.6 t/s ]

/exit
Exiting... llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - Host | 1222 = 522 + 211 + 489 | 

因为模型很小,智能比较弱,胡说一通后,勉强算对了。
4.利用llama-server内置的web-ui功能

~/model $ ls -l total 546220 -rw------- 1 u0_a270 u0_a270 558772480 Mar 8 09:40 Qwen3.5-0.8B-UD-Q4_K_XL.gguf ~/model $ llama-server -m ./Qwen3.5-0.8B-UD-Q4_K_XL.gguf --jinja -c 0 --host 127.0.0.1 --port 8033 load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true build: 0 (unknown) with Clang 21.0.0 for Android aarch64 system info: n_threads = 8, n_threads_batch = 8, total_threads = 8 system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | REPACK = 1 | Running without SSL init: using 7 threads for HTTP server start: binding port with default address family main: loading model srv load_model: loading model './Qwen3.5-0.8B-UD-Q4_K_XL.gguf' common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on llama_params_fit_impl: no devices with dedicated memory found llama_params_fit: successfully fit params to free device memory llama_params_fit: fitting params to free memory took 0.85 seconds llama_model_loader: loaded meta data with 46 key-value pairs and 320 tensors from ./Qwen3.5-0.8B-UD-Q4_K_XL.gguf (version GGUF V3 (latest)) ... load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false) load_tensors: CPU_Mapped model buffer size = 522.43 MiB ............................................................... llama_context: CPU output buffer size = 3.79 MiB llama_kv_cache: CPU KV buffer size = 3072.00 MiB llama_kv_cache: size = 3072.00 MiB (262144 cells, 6 layers, 4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_memory_recurrent: CPU RS buffer size = 77.06 MiB llama_memory_recurrent: size = 77.06 MiB ( 4 cells, 24 layers, 4 seqs), R (f32): 5.06 MiB, S (f32): 72.00 MiB sched_reserve: reserving ... sched_reserve: Flash Attention was auto, set to enabled sched_reserve: CPU compute buffer size = 786.02 MiB sched_reserve: graph nodes = 3123 (with bs=512), 1737 (with bs=1) sched_reserve: graph splits = 1 sched_reserve: reserve took 37.35 ms, sched copies = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv load_model: initializing slots, n_slots = 4 common_speculative_is_compat: the target context does not support partial sequence removal srv load_model: speculative decoding not supported by this context slot load_model: id 0 | task -1 | new slot, n_ctx = 262144 slot load_model: id 1 | task -1 | new slot, n_ctx = 262144 slot load_model: id 2 | task -1 | new slot, n_ctx = 262144 slot load_model: id 3 | task -1 | new slot, n_ctx = 262144 srv load_model: prompt cache is enabled, size limit: 8192 MiB srv load_model: use `--cache-ram 0` to disable the prompt cache srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391 init: chat template, example_format: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant <think> </think> ' srv init: init: chat template, thinking = 0 main: model loaded main: server is listening on http://127.0.0.1:8033 main: starting the main loop... srv update_slots: all slots are idle 

系统检测到CPU有8个线程,用了7个,输出一堆参数后等待用浏览器访问http://127.0.0.1:8033

在浏览器中输入问题,输出速度比命令行慢一些,大约3t/s。

在这里插入图片描述

服务端输出如下内容:

srv log_server_r: done request: GET / 127.0.0.1 200 srv params_from_: Chat format: peg-constructed slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1 slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 3 | task 0 | processing task, is_child = 0 slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 23 slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end) srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 slot init_sampler: id 3 | task 0 | init sampler, took 0.01 ms, tokens: text = 23, total = 23 slot update_slots: id 3 | task 0 | prompt processing done, n_tokens = 23, batch.n_tokens = 23 slot print_timing: id 3 | task 0 | prompt eval time = 1447.31 ms / 23 tokens ( 62.93 ms per token, 15.89 tokens per second) eval time = 171453.86 ms / 569 tokens ( 301.32 ms per token, 3.32 tokens per second) total time = 172901.17 ms / 592 tokens slot release: id 3 | task 0 | stop processing: n_tokens = 591, truncated = 0 srv update_slots: all slots are idle ^Csrv operator(): operator(): cleaning up before exit... llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - Host | 4457 = 522 + 3149 + 786 | 

Read more

GTC 2026深度拆解:英伟达Blackwell架构封神,AI Agent迈入狂暴时代

GTC 2026深度拆解:英伟达Blackwell架构封神,AI Agent迈入狂暴时代

文章标题: * 前言:GTC 2026,AI算力与智能体的双重革命 * 一、Blackwell架构:算力革命,重新定义AI芯片天花板 * 1.1 架构概览:双芯合一,极致算力密度 * 1.2 核心技术突破:四大黑科技详解 * 1.2.1 第二代Transformer引擎:MoE模型加速神器 * 1.2.2 FP4精度革新:内存效率翻倍,低成本跑大模型 * 1.2.3 NVLink 5.0:百万GPU集群互联无瓶颈 * 1.2.4 硬件级机密计算:AI模型安全防护 * 1.3 Blackwell vs Hopper 性能参数对比 * 二、AI Agent狂暴时代:

【粉丝福利社】扣子(Coze) Skills+OpenClaw 实战:零基础玩转AI智能体

【粉丝福利社】扣子(Coze) Skills+OpenClaw 实战:零基础玩转AI智能体

💎【行业认证·权威头衔】 ✔ 华为云天团核心成员:特约编辑/云享专家/开发者专家/产品云测专家 ✔ 开发者社区全满贯:ZEEKLOG博客&商业化双料专家/阿里云签约作者/腾讯云内容共创官/掘金&亚马逊&51CTO顶级博主 ✔ 技术生态共建先锋:横跨鸿蒙、云计算、AI等前沿领域的技术布道者 🏆【荣誉殿堂】 🎖 连续三年蝉联"华为云十佳博主"(2022-2024) 🎖 双冠加冕ZEEKLOG"年度博客之星TOP2"(2022&2023) 🎖 十余个技术社区年度杰出贡献奖得主 📚【知识宝库】 覆盖全栈技术矩阵: ◾ 编程语言:.NET/Java/Python/Go/Node… ◾ 移动生态:HarmonyOS/iOS/Android/小程序 ◾ 前沿领域:

深度解析 MySQL 与 MCP 集成:从环境构建到 AI 驱动的数据交互全流程

深度解析 MySQL 与 MCP 集成:从环境构建到 AI 驱动的数据交互全流程

前言 在当前大语言模型(LLM)应用开发的浪潮中,MCP(Model Context Protocol)协议正在成为连接 AI 模型与本地数据设施的关键桥梁。本文将以 MySQL 数据库为例,详细拆解如何通过 MCP 协议让 AI 模型直接操作关系型数据库,涵盖从服务器发现、数据库架构设计、数据初始化、MCP 配置文件编写到复杂自然语言查询与写入的全过程。 第一部分:MCP 服务器的发现与配置获取 在进行任何数据交互之前,首要任务是确立连接协议与服务源。通过蓝耘 MCP 广场,开发者可以快速检索并获取所需的 MCP 服务器配置。 在搜索栏输入 mysql 关键字,系统会立即检索出相关的 MCP 服务器资源。如下图所示,搜索结果中清晰展示了 MySQL 对应的 MCP 服务卡片。 点击选中该 MCP 服务器后,

字节开源 DeerFlow 2.0——登顶 GitHub Trending 1,让 AI 可做任何事情

字节开源 DeerFlow 2.0——登顶 GitHub Trending 1,让 AI 可做任何事情

打开 deerflow 的官网,瞬间被首页的这段文字震撼到了,do anything with deerflow。让 agent 做任何事情,这让我同时想到了 openclaw 刚上线时场景。 字节跳动将 DeerFlow 彻底重写,发布 2.0 版本,并在发布当天登上 GitHub Trending 第一名。这不是一次功能迭代,而是一次从"深度研究框架"到"Super Agent 运行时基础设施"的彻底蜕变。 背景:从 v1 到 v2,发生了什么? DeerFlow(Deep Exploration and Efficient Research Flow)