基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航 | 极客日志

PythonAI算法

基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航

基于 MAI-UI-8B 多模态大模型实现 Android UI 自动化测试。通过 Docker 部署 vLLM 推理服务，封装 grounding_tool.py 和 navigation_tool.py 工具，实现图像元素定位及多步导航功能。解决显存不足、vLLM 兼容性、坐标解析等常见问题，提供完整代码示例与验证方法，适用于自动化测试工程师及 AI 应用开发者。

樱花落尽发布于 2026/3/15更新于 2026/5/2726 浏览

基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航

背景介绍

传统 UIAutomator 和 Appium 在处理复杂动态布局时存在局限，MAI-UI 项目基于 Qwen3VL-8B 多模态大模型实现了视觉化 UI 操作。

技术栈清单

组件	版本	说明
模型	MAI-UI-8B / MAI-UI-2B	阿里开源的移动端 UI 理解模型
推理引擎	vLLM >= 0.11.0	高性能推理框架
基础模型	Qwen3VL-8B	通义千问视觉语言模型
依赖库	transformers >= 4.57.0, Pillow, OpenAI SDK	-
容器镜像	qwenllm/qwenvl:qwen3vl-cu128	官方优化镜像（CUDA 12.8）
硬件需求	RTX 5090 32GB（或 8GB+ 显卡 + `--max-model-len 8192`）	-

项目核心原理

MAI-UI 的工作流程：

输入截图（PIL Image）+ 自然语言指令（如 "click settings"）
模型输出结构化响应：<thinking>思考过程</thinking><answer>{"action":"click","coordinate":[x,y]}</answer>
解析 JSON，提取规范化坐标（0-1 范围），转换为绝对像素坐标
在原图绘制标记点并保存

文字流程图：

用户指令 → 图像编码 (Base64) → vLLM 推理 → XML 解析 → 坐标归一化 → 绘制标记图

核心创新点： 不依赖 XML 树/控件 ID，纯视觉理解定位元素，对动态布局、游戏界面也能精准识别。

实战步骤

环境准备

拉取 Docker 镜像

# 使用官方优化的 Qwen3VL 镜像
docker pull nvidia/cuda:12.8-runtime-ubuntu22.04

安装以下环境：

Python 3.12 + PyTorch 2.8.0+cu128 + vLLM 0.11.0 + Transformers 5.0.0.dev0 + 其他深度学习库

下载模型文件

cd /data1/VLMs
# 下载 MAI-UI-8B（约 17GB）
# https://huggingface.co/Tongyi-MAI/MAI-UI-8B
# 或下载 MAI-UI-2B（约 5GB，显存不足时推荐）

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

import os
os.environ["HF_TOKEN"]="hf_xxxxx" # 替换为你的 token
os.environ["HF_ENDPOINT"]="https://hf-mirror.com" # 国内镜像

docker run -d \
--name MAI-UI-cu128 \
--gpus '"device=4"' \
-p 40340:8000 \
-v /data1/VLMs/MAI-UI:/root \
qwenllm/qwenvl:qwen3vl-cu128 \
tail -f /dev/null

# 保持容器运行
# 进入容器启动 vLLM 服务
docker exec -it MAI-UI-cu128 bash
cd /root && python -m vllm.entrypoints.openai.api_server \
--model /root/MAI-UI-8B \
--served-model-name MAI-UI-8B \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--trust-remote-code

curl http://localhost:40340/v1/models
# 返回 {"data":[{"id":"MAI-UI-8B",...}]} 表示成功

#!/usr/bin/env python3
import sys
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont

# 添加源码路径
current_dir = Path(__file__).parent
if (current_dir / "src").exists():
    sys.path.insert(0, str(current_dir / "src"))

from mai_grounding_agent import MAIGroundingAgent
from utils import extract_click_coordinates

class UIGroundingTool:
    def __init__(self, llm_base_url="http://localhost:40340/v1", model_name="MAI-UI-8B"):
        """初始化定位工具"""
        self.agent = MAIGroundingAgent(
            llm_base_url=llm_base_url,
            model_name=model_name,
            runtime_conf={"temperature": 0.0, # 贪婪解码，确保结果稳定
                          "top_k": -1,
                          "top_p": 1.0,
                          "max_tokens": 2048},
        )

    def process(self, image_path: str, instruction: str, output_path: str = None) -> dict:
        """处理单张图像的元素定位
        Args:
            image_path: 输入图像路径
            instruction: 定位指令（如 "click the email icon"）
            output_path: 输出标记图路径（可选）
        Returns:
            包含坐标、预测结果的字典
        """
        image = Image.open(image_path)
        print(f"✓ 图像已加载：{image.size}")

        # 调用模型预测
        prediction, action = self.agent.predict(instruction, image)
        click_coords = extract_click_coordinates(action)
        if not click_coords:
            return {"success": False, "coordinates": None}

        # 转换为绝对坐标
        abs_coords = (int(click_coords[0] * image.width), int(click_coords[1] * image.height))

        # 绘制标记点
        marked_image = self._draw_marker(image, abs_coords)
        if output_path is None:
            output_path = str(Path(image_path).parent / f"{Path(image_path).stem}_marked.png")
        marked_image.save(output_path)

        return {"success": True, "coordinates": {"normalized": click_coords, "absolute": abs_coords}, "output_path": output_path}

    @staticmethod
    def _draw_marker(image: Image.Image, coords: tuple, radius: int = 15) -> Image.Image:
        """在图像上绘制圆形标记点"""
        marked = image.copy()
        draw = ImageDraw.Draw(marked)
        x, y = coords
        # 绘制圆形和十字
        draw.ellipse([x - radius, y - radius, x + radius, y + radius], outline="red", width=3)
        draw.line([(x - radius - 5, y), (x + radius + 5, y)], fill="red", width=2)
        draw.line([(x, y - radius - 5), (x, y + radius + 5)], fill="red", width=2)
        # 标注坐标
        draw.text((x + radius + 10, y - radius - 10), f"({x},{y})", fill="red")
        return marked

class UINavigationTool:
    def __init__(self, llm_base_url="http://localhost:40340/v1", model_name="MAI-UI-8B"):
        self.agent = MAIUINavigationAgent(
            llm_base_url=llm_base_url,
            model_name=model_name,
            runtime_conf={"history_n": 3, # 保留最近 3 步历史
                          "temperature": 0.0,
                          "max_tokens": 2048},
        )

    def process_sequence(self, image_paths: list, instruction: str, output_dir: str = None):
        """处理多步导航任务
        Args:
            image_paths: 按时间顺序排列的截图路径列表
            instruction: 导航指令（如 "open settings and turn on wifi"）
        """
        images = [Image.open(p) for p in image_paths]
        all_results = []
        for i, image in enumerate(images, 1):
            obs = {"screenshot": image}
            prediction, action = self.agent.predict(instruction, obs)
            # 提取坐标并绘制
            click_coords = extract_click_coordinates(action)
            if click_coords:
                abs_coords = (int(click_coords[0] * image.width), int(click_coords[1] * image.height))
                marked = self._draw_marker(image, abs_coords)
                marked.save(f"{output_dir}/step_{i:02d}_marked.png")
                all_results.append({"step": i, "coordinates": abs_coords})
        return {"success": True, "total_steps": len(images), "steps": all_results}

from grounding_tool import UIGroundingTool

tool = UIGroundingTool()
result = tool.process(
    image_path="/data1/VLMs/MAI-UI/MAI-UI/resources/example_img/figure1.png",
    instruction="click the email icon",
    output_path="grounding_result.png"
)
print(result)
# 输出:
# {
#  "success": True,
#  "coordinates": {
#   "normalized": (0.146, 0.534),
#   "absolute": (157, 1280)
#  },
#  "output_path": "grounding_result.png"
# }

from navigation_tool import UINavigationTool

tool = UINavigationTool()
result = tool.process_sequence(
    image_paths=["resources/example_img/figure1.png", # 主屏幕
                 "resources/example_img/figure2.png"], # 设置页面
    instruction="open the settings and turn on the wifi",
    output_dir="navigation_results/"
)
print(result)
# 输出:
# {
#  "success": True,
#  "total_steps": 2,
#  "steps": [
#   {"step": 1, "coordinates": (932, 1991)}, # 点击设置齿轮
#   {"step": 2, "coordinates": (406, 879)} # 点击网络设置
#  ]
# }

def parse_grounding_response(text: str) -> Dict[str, Any]:
    """解析模型输出的 XML 结构"""
    result = {"thinking": None, "coordinate": None}
    # 提取思考过程
    think_match = re.search(r"<grounding_think>(.*?)</grounding_think>", text, re.DOTALL)
    if think_match:
        result["thinking"] = think_match.group(1).strip()
    # 提取坐标 JSON
    answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    if answer_match:
        answer_json = json.loads(answer_match.group(1).strip())
        coordinates = answer_json["coordinate"]
        # 关键：归一化坐标
        point_x = coordinates[0] / 999 # SCALE_FACTOR = 999
        point_y = coordinates[1] / 999
        result["coordinate"] = [point_x, point_y]
    return result

def _build_messages(self, instruction: str, image: Image.Image) -> list:
    """构建 OpenAI API 格式的消息列表"""
    # 图像转 Base64
    encoded_string = pil_to_base64(image)
    messages = [{"role": "system", "content": [{"type": "text", "text": self.system_prompt}]},
                {"role": "user", "content": [{"type": "text", "text": instruction + "\n"},
                                              {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}}]}]
    return messages

@property
def history_responses(self) -> List[str]:
    """生成历史响应列表"""
    history_responses = []
    for step in self.traj_memory.steps:
        # 将归一化坐标还原为 [0, 999] 范围
        action_json = copy.deepcopy(step.structured_action["action_json"])
        if "coordinate" in action_json:
            point_x, point_y = action_json["coordinate"]
            action_json["coordinate"] = [int(point_x * 999), # 还原为模型输出格式
                                         int(point_y * 999)]
        # 构造标准响应格式
        tool_call = {"name": "mobile_use", "arguments": action_json}
        response = f"<thinking>\n{step.thought}\n</thinking>\n<tool_call>\n{json.dumps(tool_call)}\n</tool_call>"
        history_responses.append(response)
    return history_responses

<grounding_think> Thought: The instruction "click the email icon" directs me to the app icon labeled "Mail" showing an envelope symbol in the second row, first column. </grounding_think>
<answer> {"coordinate":[146,533]} </answer>

步骤	截图	模型思考	动作坐标
1	主屏幕	'点击设置齿轮图标'	(932, 1991)
2	设置页面	'进入网络与互联网设置'	(406, 879)

navigation_results/
├── step_01_marked.png # 齿轮图标标记
└── step_02_marked.png # 网络设置标记

adb shell input tap 932 1991 # 第一步
adb shell input tap 406 879 # 第二步

ValueError: To serve at least one request with max_model_len (262144), 36.00 GiB KV cache is needed, but only 5.77 GiB available.

# 方案 1：降低 max_model_len（推荐）
python -m vllm.entrypoints.openai.api_server \
--model /root/MAI-UI-8B \
--max-model-len 8192 \
--trust-remote-code
# 降至 8K，显存需求降到 6GB

# 方案 2：使用 MAI-UI-2B（2B 参数版本，显存需求 3GB）
python -m vllm.entrypoints.openai.api_server \
--model /root/MAI-UI-2B \
--trust-remote-code

# 方案 3：提高 GPU 利用率
--gpu-memory-utilization 0.9 # 默认 0.8，提升到 0.9

ValueError: Qwen3VLForConditionalGeneration has no vLLM implementation

# 错误做法：使用通用镜像
docker pull test-agent-model:v1.4 # ✗

# 正确做法：使用官方 Qwen3VL 镜像
docker pull qwenllm/qwenvl:qwen3vl-cu128 # ✓

ImportError: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS...

# 在容器内重新安装匹配的 flash_attn
pip install flash-attn --no-build-isolation

We had to rate limit your IP (162.159.108.123)

import os
os.environ["HF_TOKEN"]="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"
os.environ["HF_ENDPOINT"]="https://hf-mirror.com"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="Tongyi-MAI/MAI-UI-8B",
    local_dir="./MAI-UI-8B",
    token=os.environ["HF_TOKEN"] # 添加认证
)

docker: Error response from daemon: no command specified

# 错误做法
docker run -d --name MAI-UI qwenllm/qwenvl:qwen3vl-cu128 # ✗

# 正确做法 1：使用 tail 保持容器运行
docker run -d --name MAI-UI qwenllm/qwenvl:qwen3vl-cu128 tail -f /dev/null

# 正确做法 2：直接启动 vLLM（推荐）
docker run -d --name MAI-UI \
qwenllm/qwenvl:qwen3vl-cu128 \
python -m vllm.entrypoints.openai.api_server --model /root/MAI-UI-8B ...

result = tool.process(...)
result["coordinates"] # None

prediction, action = agent.predict(...)
print(prediction) # 应包含 <answer>{"coordinate":[x,y]}</answer>

import re
answer_match = re.search(r"<answer>(.*?)</answer>", prediction, re.DOTALL)
if not answer_match:
    print("正则匹配失败！检查 XML 标签")

import json
json.loads(answer_match.group(1)) # 可能抛出 JSONDecodeError

import subprocess

def execute_action(action):
    if action["action"] == "click":
        x, y = action["coordinate"]
        subprocess.run(f"adb shell input tap {int(x)} {int(y)}", shell=True)
    elif action["action"] == "type":
        text = action["text"]
        subprocess.run(f"adb shell input text '{text}'", shell=True)

def robust_predict(tool, image_path, instruction, max_retries=3):
    for i in range(max_retries):
        result = tool.process(image_path, instruction)
        if result["success"]:
            return result
        time.sleep(1)
    return None

基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航

基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航

背景介绍

技术栈清单

项目核心原理

实战步骤

环境准备

拉取 Docker 镜像

下载模型文件

更多推荐文章

相关免费在线工具

启动推理容器

验证服务

核心工具封装

Grounding 工具（元素定位）

Navigation 工具（多步导航）

功能测试

测试元素定位

测试多步导航

核心代码解析

模型响应解析（parse_grounding_response）

消息构建（_build_messages）

历史上下文维护（Navigation 专属）

效果验证

Grounding 定位效果

Navigation 多步导航效果

踩坑记录与解决方案

显存不足导致服务启动失败

vLLM 不支持 Qwen3VLForConditionalGeneration

flash_attn 版本不兼容

HuggingFace 模型下载 IP 限流

Docker 容器无默认启动命令

坐标解析失败返回 None

总结与扩展方向

核心收获

扩展方向

参考资料

更多推荐文章

相关免费在线工具

基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航

基于 MAI-UI-8B 实现 Android UI 自动化：从元素定位到多步导航

背景介绍

技术栈清单

项目核心原理

实战步骤

环境准备

拉取 Docker 镜像

下载模型文件

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

启动推理容器

验证服务

核心工具封装

Grounding 工具（元素定位）

Navigation 工具（多步导航）

功能测试

测试元素定位

测试多步导航

核心代码解析

模型响应解析（parse_grounding_response）

消息构建（_build_messages）

历史上下文维护（Navigation 专属）

效果验证

Grounding 定位效果

Navigation 多步导航效果

踩坑记录与解决方案

显存不足导致服务启动失败

vLLM 不支持 Qwen3VLForConditionalGeneration

flash_attn 版本不兼容

HuggingFace 模型下载 IP 限流

Docker 容器无默认启动命令

坐标解析失败返回 None

总结与扩展方向

核心收获

扩展方向

参考资料

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具