RoboBrain2.0 具身大脑模型复现：统一感知、推理与规划能力 | 极客日志

PythonAI算法

RoboBrain2.0 具身大脑模型复现：统一感知、推理与规划能力

RoboBrain 2.0 是支持统一感知、推理和规划的具身大脑模型，提供 3B、7B、32B 版本。介绍环境搭建（Conda、依赖库、Torch）及多场景推理示例，包括图文问答（含思考模式）、目标检测、可供性预测、轨迹预测、指向预测及导航任务。

城市逃兵发布于 2026/4/5更新于 2026/7/2454 浏览

RoboBrain 2.0 是一个机器人的具身大脑模型，具备统一感知、推理和规划能力；同时适应对物理环境中复杂的具身任务。它提供不同版本：轻量级的 3B、7B 模型和全尺寸的 32B 模型，包含视觉编码器和语言模型。

代码地址：https://github.com/FlagOpen/RoboBrain2.0

论文地址：RoboBrain 2.0 Technical Report

快速了解模型

RoboBrain 2.0 支持交互式推理，包括长远规划和闭环反馈、从复杂指令中精确预测点和边界框的空间感知、用于估计未来轨迹的时间感知，以及通过实时结构化记忆构建和更新进行场景推理。

文章配图

模型架构：

文章配图

1、创建 Conda 环境

首先创建一个 Conda 环境，名字为 robobrain2，python 版本为 3.10；然后进入 robobrain2 环境，执行下面命令：

conda create -n robobrain2 python=3.10 conda activate robobrain2

下载 robobrain2 代码到本地：

git clone https://github.com/FlagOpen/RoboBrain2.0.git
cd RoboBrain2.0

2、安装依赖库

编辑 requirements.txt 文件，内容如下所示：

# pip install -r requirements.txt # 深度学习框架 & 训练/推理加速 pytorch-lightning==1.9.5 transformers==4.50.0 tokenizers==0.21.0 huggingface-hub==0.27.1 safetensors==0.5.2 accelerate==1.3.0 deepspeed==0.15.0 peft==0.14.0 trl==0.9.6 flash-attn==2.5.9.post1 xformers==0.0.28.post3 triton==3.1.0 vllm==0.7.3 tensor-parallel==1.2.4 fairscale==0.4.13 diffusers==0.29.2 bitsandbytes==0.43.3 gguf==0.10.0 # 科学计算 & 数值 / 图像 / 信号 numpy==1.26.4 scipy==1.15.1 scikit-learn==1.6.1 scikit-image==0.20.0 pandas==2.2.3 matplotlib==3.7.5 seaborn==0.13.2 Pillow==11.1.0 opencv-python==4.7.0.72 opencv-python-headless==4.11.0.86 av==14.4.0 imageio-ffmpeg==0.5.1 PyWavelets==1.4.1 numba==0.60.0 einops==0.8.0 einx==0.3.0 ml_dtypes==0.5.3 cupy-cuda12x==13.4.1 # 数据 & 特征 / 向量 / 文本 datasets==3.6.0 evaluate==0.4.2 sentence-transformers==3.4.1 FlagEmbedding==1.3.4 openai==1.60.0 tiktoken==0.7.0 sentencepiece==0.2.0 regex==2024.11.6 ftfy==6.2.0 chattts==0.2.1 qwen-vl-utils==0.0.8 # 分布式 / 并行 / 集群 ray==2.40.0 dask==2023.4.1 torch-ort==1.17.0 msccl==2.3.0 # Web 服务 & API 框架 fastapi==0.115.6 uvicorn==0.34.0 starlette==0.41.3 gradio==5.12.0 gradio_client==1.5.4 httpx==0.27.2 requests==2.32.3 pydantic==2.10.5 pydantic-settings==2.7.1 typer==0.15.1 # 异步 / 并发 / 网络 aiohttp==3.11.11 anyio==4.8.0 websockets==14.4 tornado==6.4.1 async-timeout==4.0.3 # 配置 / 日志 / 进度 / 序列化 omegaconf==2.3.0 tqdm PyYAML orjson==3.10.14 msgpack==1.1.0 lz4==4.4.4 xxhash==3.5.0 # 其他常用工具 typing_extensions==4.12.2 packaging==24.2 filelock==3.16.1 psutil==7.0.0

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

pip install -r requirements.txt

pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu121

Trying to resume download...
model-00001-of-00004.safetensors: 100%|████████████████████████████████████████████████████| 4.97G/4.97G [01:30<00:00, 19.4MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████████████████████| 4.99G/4.99G [00:40<00:00, 24.6MB/s]
model-00002-of-00004.safetensors: 80%|█████████████████████████████████████████▌ | 4.00G/4.99G [01:34<01:36, 10.3MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████████████████████| 4.93G/4.93G [01:57<00:00, 19.8MB/s]
Fetching 4 files: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [01:58<00:00, 29.70s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 3.59it/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████| 214/214 [00:00<00:00, 1.05MB/s]
Some parameters are on the meta device because they were offloaded to the cpu.████████████▉| 4.93G/4.93G [01:57<00:00, 24.4MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 1.70MB/s]
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
tokenizer_config.json: 5.70kB [00:00, 16.0MB/s]
vocab.json: 2.78MB [00:37, 75.0kB/s]
merges.txt: 1.67MB [00:00, 3.62MB/s]
tokenizer.json: 7.03MB [00:00, 8.12MB/s]
chat_template.json: 1.05kB [00:00, 3.80MB/s]

from inference import UnifiedInference
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "What is shown in this image?"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
pred = model.inference(prompt, image, task="general", enable_thinking=False, do_sample=True)
print(f"Prediction:\n{pred}")

from inference import UnifiedInference
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "What is shown in this image?"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
pred = model.inference(prompt, image, task="general", enable_thinking=True, do_sample=True)
print(f"Prediction:\n{pred}")

from inference import UnifiedInference
model = UnifiedInference("BAAI/RoboBrain2.0-3B")
prompt = "What is shown in this image?"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
pred = model.inference(prompt, image, task="general", do_sample=True)
print(f"Prediction:\n{pred}")

from inference import UnifiedInference
import cv2, os, datetime
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "the person wearing a red hat"
image_path = "./assets/demo/grounding.jpg"
# 1. 运行推理，plot=True 会自动在 result/ 目录生成带框图
pred = model.inference(
 prompt, image_path, task="grounding", plot=True, enable_thinking=True, do_sample=True
)
print(f"Prediction:\n{pred}")
# 2. 只保存可视化结果（自己定目录 + 文件名）
save_dir = "vis_results"
os.makedirs(save_dir, exist_ok=True)
vis_img = cv2.imread("result/grounding_with_grounding_annotated.jpg")
out_path = os.path.join(save_dir, f"redhat_det_{datetime.datetime.now().strftime('%H%M%S')}.jpg")
cv2.imwrite(out_path, vis_img)
print("可视化结果已保存至:", os.path.abspath(out_path))

from inference import UnifiedInference
import cv2, os, datetime
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "如何抓住杯子"
image = "./assets/demo/affordance.jpg"
# 运行推理，plot=True 会自动保存可视化图到 result/affordance_with_affordance_annotated.jpg
pred = model.inference(
 prompt, image, task="affordance", plot=True, enable_thinking=True, do_sample=True
)
print(f"Prediction:\n{pred}")
# ----------- 仅保存可视化结果 -----------
save_dir = "vis_results"
os.makedirs(save_dir, exist_ok=True)
vis_img = cv2.imread("result/affordance_with_affordance_annotated.jpg")
out_path = os.path.join(save_dir, f"affordance_{datetime.datetime.now().strftime('%H%M%S')}.jpg")
cv2.imwrite(out_path, vis_img)
print("可视化结果已保存至:", os.path.abspath(out_path))

from inference import UnifiedInference
import cv2, os, datetime
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "伸手去拿红色的瓶子"
image = "./assets/demo/trajectory.jpg"
# 运行推理，plot=True 会自动保存可视化图到 result/trajectory_with_trajectory_annotated.jpg
pred = model.inference(
 prompt, image, task="trajectory", plot=True, enable_thinking=True, do_sample=True
)
print(f"Prediction:\n{pred}")

==================== INPUT ====================
You are a robot using the joint control. The task is "伸手去拿红色的瓶子". Please predict up to 10 key trajectory points to complete the task. Your answer should be formatted as a list of tuples, i.e. [[x1, y1], [x2, y2], ...], where each tuple contains the x and y coordinates of a point.

Thinking enabled.
Running inference ...
Plotting enabled. Drawing results on the image ...
Extracted trajectory points: [[(145, 123), (259, 168), (327, 136)]]
Annotated image saved to: result/trajectory_with_trajectory_annotated.jpg
Prediction:
{'answer': '[(145, 123), (259, 168), (327, 136)]', 'thinking': "From the visual input, the target object, a red bottle, is clearly identified resting upright on a black tray positioned towards the right side of the scene. My current end-effector position is to the left of the scene, near several other objects like cups and plates that do not obstruct the path directly. Notably, there's ample space between my initial position and the red bottle, allowing for an unobstructed path.\n\nMy joint control system enables me to generate smooth trajectories. I will plan a sequence of movements starting from my current location, moving towards the red bottle. Given the open field, I can calculate intermediate points ensuring a direct approach without collision with nearby objects. Up to 10 key points are suggested but fewer may suffice given the straightforward nature of this task.\n\nThe task involves reaching and grasping the red bottle. The trajectory must be efficient and direct, originating from the initial position on the left and terminating precisely at the bottle. Each segment of the motion should ensure clearance from surrounding objects, minimizing potential interaction. This requires careful placement of key waypoints to guide the motion smoothly.\n\nVerification confirms the path's logical progression. The sequence of points is checked to ensure they maintain a safe distance from obstacles while advancing toward the red bottle. The final point must accurately target the bottle's handle or body for a successful grasp.\n\nTherefore, based on this comprehensive visual analysis and planning, the key trajectory points to reach the red bottle are determined as [(145, 123), (259, 168), (327, 136)]. These points form a viable path from the initial position to the target while respecting the environment."}

from inference import UnifiedInference
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "在两个杯子之间的空隙中，找出几个可以放置杯子的位置"
image = "./assets/demo/pointing.jpg"
pred = model.inference(
 prompt, image, task="pointing", plot=True, enable_thinking=True, do_sample=True
)
print(f"Prediction:\n{pred}")

==================== INPUT ====================
在两个杯子之间的空隙中，找出几个可以放置杯子的位置. Your answer should be formatted as a list of tuples, i.e. [(x1, y1), (x2, y2), ...], where each tuple contains the x and y coordinates of a point satisfying the conditions above. The coordinates should indicate the normalized pixel locations of the points in the image.

Thinking enabled.
Running inference ...
Plotting enabled. Drawing results on the image ...
Extracted points:[(334, 320), (305, 316), (369, 315), (389, 317), (314, 313), (350, 314), (378, 313), (325, 317)]
Annotated image saved to: result/pointing_with_pointing_annotated.jpg
Prediction:
{'answer': '[(334, 320), (305, 316), (369, 315), (389, 317), (314, 313), (350, 314), (378, 313), (325, 317)]', 'thinking': "From the visual input, two cups are clearly placed on a horizontal surface, with one blue cup on the left and another green cup on the right. The empty space between them is crucial for identifying potential placement points. This area lies flat on the surface and offers an unobstructed path, making it an ideal candidate for placing additional cups.\n\nMy advanced visual processing capability allows me to segment this gap visually, ensuring that any identified points will lie entirely within this region, avoiding edges or other objects nearby. It is essential to confirm that there are no physical obstructions like shadows or reflections, which might affect perceived space availability.\n\nThe task requires determining several spots within this vacant area. I start by verifying its dimensions, confirming the gap's width and height. Next, I select multiple points, spread evenly across the free space, to ensure diversity and coverage across the entire visible region, simulating different spots to place an additional cup effectively.\n\nEach identified point undergoes verification to ensure it falls strictly within the boundaries of the open space between the cups. Distinctness among points is also confirmed, avoiding overlap and maintaining separation.\n\nTherefore, through direct visual analysis, combined with my cognitive capabilities, the identified points within the vacant space are [(334, 320), (305, 316), (369, 315), (389, 317), (314, 313), (350, 314), (378, 313), (325, 317)]. These points fulfill all requirements as they are located centrally within the available space, ensuring effective placement."}

from inference import UnifiedInference
model = UnifiedInference("BAAI/RoboBrain2.0-7B")
prompt = "来到沙发这里，我需要休息了"
image = "./assets/demo/navigation.jpg"
pred = model.inference(
 prompt, image, task="pointing", plot=True, enable_thinking=True, do_sample=True
)
print(f"Prediction:\n{pred}")

RoboBrain2.0 具身大脑模型复现：统一感知、推理与规划能力

快速了解模型

1、创建 Conda 环境

2、安装依赖库

更多推荐文章

相关免费在线工具

3、安装 torch

4、模型推理

示例 1：图文问答，使用 RoboBrain2.0-7B 模型，不开思考模式

==================== INPUT ====================
What is shown in this image?

示例 2：图文问答，使用 RoboBrain2.0-7B 模型，开启思考模式

==================== INPUT ====================
What is shown in this image?

示例 3：图文问答，使用 RoboBrain2.0-3B 模型

==================== INPUT ====================
What is shown in this image?

示例 4：视觉基础能力，目标检测

==================== INPUT ====================
Please provide the bounding box coordinate of the region this sentence describes: the person wearing a red hat.

==================== INPUT ====================
Please provide the bounding box coordinate of the region this sentence describes: 找到一个香蕉.

示例 5：具身认知，用于可供性预测

==================== INPUT ====================
You are a robot using the joint control. The task is "如何抓住杯子". Please predict a possible affordance area of the end effector.

示例 6：用于具身的轨迹预测

示例 7：用于指向预测（具身认知）

示例 8：用于具身导航任务

更多推荐文章

相关免费在线工具

RoboBrain2.0 具身大脑模型复现：统一感知、推理与规划能力

快速了解模型

1、创建 Conda 环境

2、安装依赖库

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3、安装 torch

4、模型推理

示例 1：图文问答，使用 RoboBrain2.0-7B 模型，不开思考模式

==================== INPUT ==================== What is shown in this image?

示例 2：图文问答，使用 RoboBrain2.0-7B 模型，开启思考模式

==================== INPUT ==================== What is shown in this image?

示例 3：图文问答，使用 RoboBrain2.0-3B 模型

==================== INPUT ==================== What is shown in this image?

示例 4：视觉基础能力，目标检测

==================== INPUT ==================== Please provide the bounding box coordinate of the region this sentence describes: the person wearing a red hat.

==================== INPUT ==================== Please provide the bounding box coordinate of the region this sentence describes: 找到一个香蕉.

示例 5：具身认知，用于可供性预测

==================== INPUT ==================== You are a robot using the joint control. The task is "如何抓住杯子". Please predict a possible affordance area of the end effector.

示例 6：用于具身的轨迹预测

示例 7：用于指向预测（具身认知）

示例 8：用于具身导航任务

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

==================== INPUT ====================
What is shown in this image?

==================== INPUT ====================
What is shown in this image?

==================== INPUT ====================
What is shown in this image?

==================== INPUT ====================
Please provide the bounding box coordinate of the region this sentence describes: the person wearing a red hat.

==================== INPUT ====================
Please provide the bounding box coordinate of the region this sentence describes: 找到一个香蕉.

==================== INPUT ====================
You are a robot using the joint control. The task is "如何抓住杯子". Please predict a possible affordance area of the end effector.