StreamVLN 具身导航复现与模型推理指南

介绍 StreamVLN 模型的复现流程，包括 Conda 环境搭建、Habitat 仿真环境安装、依赖库及数据集准备。详细说明了两种模型权重的下载方式（基准测试与真实世界部署），并提供了多 GPU 与单 GPU 的评估推理命令。此外，还展示了修改后的评估脚本逻辑及训练指令，适用于视觉语言导航任务的研究与部署。

AiEngineer发布于 2026/4/6更新于 2026/7/2659 浏览

StreamVLN 具身导航复现

StreamVLN 通过在线、多轮对话的方式，输入连续视频，输出动作序列。结合语言指令、视觉观测和空间位姿信息，驱动模型生成导航动作（前进、左转、右转、停止）。

论文地址：StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

代码地址：https://github.com/OpenRobotLab/StreamVLN

1. 创建 Conda 环境

首先创建一个 Conda 环境，名字为 streamvln，Python 版本为 3.9。然后进入该环境。

conda create -n streamvln python=3.9
conda activate streamvln

2. 安装 Habitat 仿真环境

先安装 habitat-sim：

conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat

再安装 habitat-lab：

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines

3. 安装第三方依赖库

获取 StreamVLN 的代码：

git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN

安装其他依赖库：

pip install -r requirements.txt

补丁安装：2025/7/23 补丁需安装 protobuf==3.20.1。

pip install protobuf==3.20.1

4. 准备数据集

需要准备三种类型的数据，新建一个 data 文件夹来存放。

1) Matterport3D (MP3D) Scenes

快速下载地址：https://cloud.tsinghua.edu.cn/f/03e0ca1430a344efa72b/?dl=1

每个文件夹中，包含一个 .glb 文件。

2) VLN-CE Episodes

下载 VLN-CE episodes 的链接，然后重命名：

r2r（重命名 -> ）

StreamVLN 具身导航复现

论文地址：StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

代码地址：https://github.com/OpenRobotLab/StreamVLN

1. 创建 Conda 环境

首先创建一个 Conda 环境，名字为 streamvln，Python 版本为 3.9。然后进入该环境。

conda create -n streamvln python=3.9
conda activate streamvln

2. 安装 Habitat 仿真环境

先安装 habitat-sim：

conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat

再安装 habitat-lab：

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines

3. 安装第三方依赖库

获取 StreamVLN 的代码：

git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN

安装其他依赖库：

pip install -r requirements.txt

补丁安装：2025/7/23 补丁需安装 protobuf==3.20.1。

pip install protobuf==3.20.1

4. 准备数据集

需要准备三种类型的数据，新建一个 data 文件夹来存放。

1) Matterport3D (MP3D) Scenes

快速下载地址：https://cloud.tsinghua.edu.cn/f/03e0ca1430a344efa72b/?dl=1

每个文件夹中，包含一个 .glb 文件。

2) VLN-CE Episodes

下载 VLN-CE episodes 的链接，然后重命名：

r2r（重命名 -> ）

# 导入系统相关库 import sys import os sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))) # 导入正则表达式、进度条、PyTorch 等工具库 import re import tqdm import torch import copy import json import random import argparse import itertools import quaternion import transformers import numpy as np # 导入类型注解、配置工具、图像处理等库 from typing import Any from omegaconf import OmegaConf from PIL import Image, ImageFile, ImageDraw, ImageFont from collections import OrderedDict from torch.nn.utils.rnn import pad_sequence # 导入深度图像过滤函数 from depth_camera_filtering import filter_depth from transformers.image_utils import to_numpy_array # 导入 Habitat 环境相关库 import habitat from habitat import logger, Env from habitat_extensions import measures from habitat.config.default import get_agent_config from habitat_baselines.config.default import get_config as get_habitat_config from habitat.config.default_structured_configs import ( CollisionsMeasurementConfig, FogOfWarConfig, TopDownMapMeasurementConfig, ) from habitat.utils.visualizations import maps from habitat.utils.visualizations.utils import images_to_video, observations_to_image # 导入自定义模型和工具函数 from model.stream_video_vln import StreamVLNForCausalLM from utils.utils import dict_to_cuda from utils.dist import * from utils.utils import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX, DEFAULT_MEMORY_TOKEN, MEMORY_TOKEN_INDEX class VLNEvaluator: """视觉语言导航 (VLN) 评估器类，用于评估模型在 Habitat 环境中的导航性能""" def __init__(self, config_path: str, split: str = "val_seen", env_num: int = 8, output_path: str = None, model: Any = None, tokenizer: Any = None, epoch: int = 0, args: argparse.Namespace = None): self.args = args self.device = torch.device('cuda') self.split = split self.env_num = env_num self.save_video = args.save_video self.output_path = output_path self.epoch = epoch self.config_path = config_path self.config = get_habitat_config(config_path) self.agent_config = get_agent_config(self.config.habitat.simulator) self.sim_sensors_config = self.config.habitat.simulator.agents.main_agent.sim_sensors with habitat.config.read_write(self.config): self.config.habitat.dataset.split = self.split self.config.habitat.task.measurements.update({ "top_down_map": TopDownMapMeasurementConfig(map_padding=3, map_resolution=1024, draw_source=True, draw_border=True, draw_shortest_path=True, draw_view_points=True, draw_goal_positions=True, draw_goal_aabbs=True, fog_of_war=FogOfWarConfig(draw=True, visibility_dist=5.0, fov=90)), "collisions": CollisionsMeasurementConfig(), }) print(f"config 类型 = {type(self.config)}") print(OmegaConf.to_yaml(self.config)) self._camera_height = self.sim_sensors_config.rgb_sensor.position[1] self._min_depth = self.sim_sensors_config.depth_sensor.min_depth self._max_depth = self.sim_sensors_config.depth_sensor.max_depth camera_fov_rad = np.deg2rad(self.sim_sensors_config.depth_sensor.hfov) self._camera_fov = camera_fov_rad self._fx = self._fy = self.sim_sensors_config.depth_sensor.width / (2 * np.tan(camera_fov_rad / 2)) self.image_processor = model.get_vision_tower().image_processor self.model = model self.tokenizer = tokenizer prompt = f"<video>\nYou are an autonomous navigation assistant. Your task is to <instruction>. Devise an action sequence to follow the instruction using the four actions: TURN LEFT (←) or TURN RIGHT (→) by 15 degrees, MOVE FORWARD (↑) by 25 centimeters, or STOP." self.conversation = [{"from": "human", "value": prompt}, {"from": "gpt", "value": "answer"}] self.actions2idx = OrderedDict({'STOP': [0], "↑": [1], "←": [2], "→": [3]}) self.conjunctions = ['you can see ', 'in front of you is ', 'there is ', 'you can spot ', 'you are toward the ', 'ahead of you is ', 'in your sight is '] self.num_frames = args.num_frames self.num_future_steps = args.num_future_steps self.num_history = args.num_history # ... (中间省略部分辅助函数以保持简洁，实际使用时保留完整代码) # 包括 preprocess_depth_image, get_intrinsic_matrix, preprocess_instrinsic, get_axis_align_matrix, xyz_yaw_to_tf_matrix, config_env, eval_action, parse_actions, preprocess_qwen, pad_tensors 等 def eval_action(self, idx) -> None: # 核心评估逻辑 pass def eval(): global local_rank parser = argparse.ArgumentParser() parser.add_argument("--local_rank", default=0, type=int, help="本地进程排名") parser.add_argument("--model_path", type=str, help="模型路径") parser.add_argument("--habitat_config_path", type=str, default='config/vln_r2r.yaml', help="Habitat 配置文件路径") parser.add_argument("--eval_split", type=str, default='val_unseen', help="评估数据集分割") parser.add_argument("--output_path", type=str, default='./results/val_unseen/streamvln', help="结果输出路径") parser.add_argument("--num_future_steps", type=int, default=4, help="未来步骤数") parser.add_argument("--num_frames", type=int, default=32, help="每批处理的帧数") parser.add_argument("--save_video", default=True, help="是否保存导航视频") parser.add_argument("--num_history", type=int, default=8, help="历史帧数") parser.add_argument("--model_max_length", type=int, default=4096, help="模型最大序列长度") parser.add_argument('--world_size', default=1, type=int, help='分布式进程数') parser.add_argument('--rank', default=0, type=int, help='进程排名') parser.add_argument('--gpu', default=0, type=int, help='GPU 设备 ID') parser.add_argument('--port', default='1111', help='分布式通信端口') parser.add_argument('--dist_url', default='env://', help='分布式通信 URL') parser.add_argument('--device', default='cuda', help='设备类型') args = parser.parse_args() init_distributed_mode(args) local_rank = args.local_rank tokenizer = transformers.AutoTokenizer.from_pretrained(args.model_path, model_max_length=args.model_max_length, padding_side="right") config = transformers.AutoConfig.from_pretrained(args.model_path) model = StreamVLNForCausalLM.from_pretrained(args.model_path, attn_implementation="eager", torch_dtype=torch.bfloat16, config=config, low_cpu_mem_usage=False) model.model.num_history = args.num_history model.requires_grad_(False) model.to(local_rank) evaluate(model, tokenizer, args) def evaluate(model, tokenizer, args): model.eval() world_size = get_world_size() model.reset(world_size) evaluator = VLNEvaluator(config_path=args.habitat_config_path, split=args.eval_split, env_num=world_size, output_path=args.output_path, model=model, tokenizer=tokenizer, epoch=0, args=args) sucs, spls, oss, ones, ep_num = evaluator.eval_action(get_rank()) # 分布式汇总逻辑... result_all = { "平均成功率": (sum(sucs_all)/len(sucs_all)).item(), "平均 SPL": (sum(spls_all)/len(spls_all)).item(), "平均 Oracle 成功率": (sum(oss_all)/len(oss_all)).item(), "平均到目标距离": (sum(ones_all)/len(ones_all)).item(), "总 episode 数": len(sucs_all) } print(result_all) if get_rank() == 0: with open(os.path.join(args.output_path, f'result.json'), 'a') as f: f.write(json.dumps(result_all)) if __name__ == "__main__": eval()

StreamVLN 具身导航复现与模型推理指南

StreamVLN 具身导航复现

1. 创建 Conda 环境

2. 安装 Habitat 仿真环境

3. 安装第三方依赖库

4. 准备数据集

1) Matterport3D (MP3D) Scenes

2) VLN-CE Episodes

StreamVLN 具身导航复现与模型推理指南

StreamVLN 具身导航复现

1. 创建 Conda 环境

2. 安装 Habitat 仿真环境

3. 安装第三方依赖库

4. 准备数据集

1) Matterport3D (MP3D) Scenes

2) VLN-CE Episodes

更多推荐文章

相关免费在线工具

3) Collected Trajectory Data

5. 下载模型权重

模型权重 1：基准测试重现（仿真环境）

模型权重 2：真实世界部署

6. 模型评估推理

1) 多 GPU 评估推理

改进版评估脚本

2) 单 GPU 评估推理

7. 模型训练

更多推荐文章

相关免费在线工具

StreamVLN 具身导航复现与模型推理指南

StreamVLN 具身导航复现

1. 创建 Conda 环境

2. 安装 Habitat 仿真环境

3. 安装第三方依赖库

4. 准备数据集

1) Matterport3D (MP3D) Scenes

2) VLN-CE Episodes

StreamVLN 具身导航复现与模型推理指南

StreamVLN 具身导航复现

1. 创建 Conda 环境

2. 安装 Habitat 仿真环境

3. 安装第三方依赖库

4. 准备数据集

1) Matterport3D (MP3D) Scenes

2) VLN-CE Episodes

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3) Collected Trajectory Data

5. 下载模型权重

模型权重 1：基准测试重现（仿真环境）

模型权重 2：真实世界部署

6. 模型评估推理

1) 多 GPU 评估推理

改进版评估脚本

2) 单 GPU 评估推理

7. 模型训练

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具