复现 StreamVLN 具身导航：流式视觉语言导航实战 | 极客日志

PythonAI算法

复现 StreamVLN 具身导航：流式视觉语言导航实战

StreamVLN 通过在线多轮对话方式输入连续视频输出动作序列，结合语言指令、视觉观测和空间位姿信息驱动模型生成导航动作。详细记录了从环境搭建、数据集准备到模型推理评估的全流程复现经验，涵盖 Conda 配置、Habitat 仿真安装、依赖库处理及权重部署，并提供了多 GPU 与单卡推理的具体命令与日志分析，适合希望落地具身导航任务的开发者参考。

CryptoLab发布于 2026/4/5更新于 2026/7/2536 浏览

StreamVLN 具身导航复现指南

StreamVLN 通过在线、多轮对话的方式，输入连续视频并输出动作序列。它结合语言指令、视觉观测和空间位姿信息，驱动模型生成导航动作（前进、左转、右转、停止）。

项目地址：StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling 代码仓库：OpenRobotLab/StreamVLN

1. 环境搭建

首先创建一个 Conda 环境，命名为 streamvln，Python 版本建议为 3.9。

conda create -n streamvln python=3.9
conda activate streamvln

安装 Habitat 仿真环境

先安装 habitat-sim：

conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat

再安装 habitat-lab：

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e .
pip install -e habitat-baselines

2. 依赖库与补丁

获取 StreamVLN 的代码并安装依赖：

git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN
pip install -r requirements.txt

注意： 2025/7/23 补丁安装，需要指定 protobuf 版本：

pip install protobuf==3.20.1

3. 数据集准备

需要准备三种类型的数据，建议在根目录下新建 data 文件夹存放。

1) Matterport3D (MP3D) Scenes

每个场景文件夹中包含一个 .glb 文件。由于数据量较大，建议从官方渠道或可靠镜像下载。

2) VLN-CE Episodes

下载 VLN-CE episodes 链接并重命名后解压到 data/datasets/ 目录中：

r2r（重命名 R2R_VLNCE_v1/ -> r2r/）
rxr（重命名 RxR_VLNCE_v0/ -> rxr/）
envdrop（重命名 -> ）

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

R2R_VLNCE_v1-3_preprocessed/envdrop/

envdrop/

data/
├── datasets/
│   ├── r2r/
│   │   ├── train/
│   │   ├── val_seen/
│   │   └── val_unseen/
│   ├── rxr/
│   │   ├── train/
│   │   ├── val_seen/
│   │   └── val_unseen/
│   └── envdrop/
│       └── ...
├── scene_datasets/
│   └── mp3d/
│       └── ...
└── trajectory_data/
    ├── R2R/
    ├── RxR/
    └── EnvDrop/

export MAGNUM_LOG=quiet HABITAT_SIM_LOG=quiet MASTER_PORT=$((RANDOM % 101 + 20000)) \
CHECKPOINT="data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln" \
echo "CHECKPOINT: ${CHECKPOINT}" \
torchrun --nproc_per_node=4 --master_port=$MASTER_PORT streamvln/streamvln_eval.py --model_path $CHECKPOINT

sh scripts/streamvln_eval_multi_gpu.sh

# 核心评估函数片段
import sys
import os
import torch
import numpy as np
from typing import Any
from omegaconf import OmegaConf
from PIL import Image, ImageFile
from collections import OrderedDict
from habitat import logger, Env
from habitat_baselines.config.default import get_config as get_habitat_config
from model.stream_video_vln import StreamVLNForCausalLM
from utils.utils import dict_to_cuda

class VLNEvaluator:
    def __init__(self, config_path: str, split: str = "val_seen", env_num: int = 8, output_path: str = None, model: Any = None, tokenizer: Any = None):
        self.args = args
        self.device = torch.device('cuda')
        self.split = split
        self.env_num = env_num
        self.save_video = args.save_video
        self.output_path = output_path
        self.epoch = epoch
        self.config_path = config_path
        self.config = get_habitat_config(config_path)
        self.agent_config = get_agent_config(self.config.habitat.simulator)
        self.sim_sensors_config = self.config.habitat.simulator.agents.main_agent.sim_sensors
        # ... 省略部分配置加载代码 ...

    def eval_action(self, idx) -> None:
        env = self.config_env()
        scene_episode_dict = {}
        for episode in env.episodes:
            if episode.scene_id not in scene_episode_dict:
                scene_episode_dict[episode.scene_id] = []
            scene_episode_dict[episode.scene_id].append(episode)
        
        intrinsic_matrix = self.get_intrinsic_matrix(self.config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor)
        sucs, spls, oss, ones = [], [], [], []
        done_res = []
        
        # 遍历每个场景
        for scene in sorted(scene_episode_dict.keys()):
            episodes = scene_episode_dict[scene]
            scene_id = scene.split('/')[-2]
            print(f"当前场景 ID = {scene_id}")
            process_bar = tqdm.tqdm(range(len(episodes[idx::self.env_num])), desc=f"场景 {scene_id}")
            
            for episode in episodes[idx::self.env_num]:
                episode_instruction = episode.instruction.instruction_text
                episode_id = episode.episode_id
                
                if [scene_id, episode_id, episode_instruction] in done_res:
                    continue
                
                self.model.reset_for_env(idx)
                env.current_episode = episode
                observations = env.reset()
                
                vis_frames = []
                step_id = 0
                rgb_list, depth_list, pose_list, intrinsic_list, time_ids = [], [], [], [], []
                action_seq = []
                past_key_values = None
                output_ids = None
                
                while not env.episode_over:
                    self.model.eval()
                    time_ids.append(step_id)
                    
                    rgb = observations["rgb"]
                    depth = observations["depth"]
                    x, y = observations["gps"]
                    camera_yaw = observations["compass"][0]
                    
                    # 过滤深度图像
                    depth = filter_depth(depth.reshape(depth.shape[:2]), blur_type=None)
                    depth = depth * (self._max_depth - self._min_depth) + self._min_depth
                    depth = depth * 1000
                    
                    agent_state = env.sim.get_agent_state()
                    height = agent_state.position[1] - initial_height
                    camera_position = np.array([x, -y, self._camera_height + height])
                    robot_xy = camera_position[:2]
                    tf_camera_to_episodic = self.xyz_yaw_to_tf_matrix(camera_position, camera_yaw)
                    
                    # 预处理图像
                    image = Image.fromarray(rgb).convert('RGB')
                    image_size = image.size
                    image = self.image_processor.preprocess(images=image, return_tensors='pt')['pixel_values'][0]
                    
                    depth_image, resize_shape = self.preprocess_depth_image(Image.fromarray(depth.astype(np.uint16), mode='I;16'), do_depth_scale=True)
                    intrinsic = self.preprocess_instrinsic(intrinsic_matrix, image_size, resize_shape)
                    intrinsic = torch.from_numpy(intrinsic).float()
                    
                    rgb_list.append(image)
                    depth_list.append(torch.from_numpy(depth_image).float())
                    pose_list.append(torch.from_numpy(tf_camera_to_episodic) @ self.get_axis_align_matrix())
                    intrinsic_list.append(intrinsic)
                    
                    # 若动作序列为空，调用模型生成
                    if len(action_seq) == 0:
                        if output_ids is None:
                            sources = copy.deepcopy(self.conversation)
                            sources[0]["value"] = sources[0]["value"].replace('<instruction>.', episode.instruction.instruction_text)
                            add_system = True
                        else:
                            sources = [{"from": "human", "value": ""}, {"from": "gpt", "value": ""}]
                            add_system = False
                        
                        input_ids, conversations = self.preprocess_qwen([sources], self.tokenizer, True, add_system=add_system)
                        if output_ids is not None:
                            input_ids = torch.cat([output_ids, input_ids.to(output_ids.device)], dim=1)
                        
                        images = rgb_list[-1:]
                        depths = depth_list[-1:]
                        poses = pose_list[-1:]
                        intrinsics = intrinsic_list[-1:]
                        
                        if step_id != 0 and step_id % self.num_frames == 0:
                            history_ids = slice(0, time_ids[0], (time_ids[0] // self.num_history))
                            images = rgb_list[history_ids] + images
                            depths = depth_list[history_ids] + depths
                            poses = pose_list[history_ids] + poses
                            intrinsics = intrinsic_list[history_ids] + intrinsics
                        
                        input_dict = {
                            'images': torch.stack(images).unsqueeze(0),
                            'depths': torch.stack(depths).unsqueeze(0),
                            'poses': torch.stack(poses).unsqueeze(0),
                            'intrinsics': torch.stack(intrinsics).unsqueeze(0),
                            'inputs': input_ids,
                            'env_id': idx,
                            'time_ids': [time_ids],
                            'task_type': [0]
                        }
                        input_dict = dict_to_cuda(input_dict, self.device)
                        
                        for key, value in input_dict.items():
                            if key in ['images', 'depths', 'poses', 'intrinsics']:
                                input_dict[key] = input_dict[key].to(torch.bfloat16)
                        
                        outputs = self.model.generate(
                            **input_dict, do_sample=False, num_beams=1, max_new_tokens=10000,
                            use_cache=True, return_dict_in_generate=True, past_key_values=past_key_values
                        )
                        output_ids = outputs.sequences
                        past_key_values = outputs.past_key_values
                        llm_outputs = self.tokenizer.batch_decode(output_ids, skip_special_tokens=False)[0].strip()
                        action_seq = self.parse_actions(llm_outputs)
                        if len(action_seq) == 0:
                            action_seq = [0]
                        
                        action = action_seq.pop(0)
                        observations = env.step(action)
                        step_id += 1
                        
                        if step_id % self.num_frames == 0:
                            self.model.reset_for_env(idx)
                            output_ids = None
                            past_key_values = None
                            time_ids = []
                    
                    metrics = env.get_metrics()
                    sucs.append(metrics['success'])
                    spls.append(metrics['spl'])
                    oss.append(metrics['oracle_success'])
                    ones.append(metrics['distance_to_goal'])
                    print(f"场景-episode {scene_id}_{episode_id} 结果：成功={metrics['success']}, SPL={metrics['spl']}")
                    
                    result = {
                        "scene_id": scene_id, "episode_id": episode_id, "success": metrics["success"],
                        "spl": metrics["spl"], "os": metrics['oracle_success'], "ne": metrics["distance_to_goal"],
                        "steps": step_id, "episode_instruction": episode_instruction
                    }
                    with open(os.path.join(self.output_path, f'result.json'), 'a') as f:
                        f.write(json.dumps(result) + "\n")
        
        return (torch.tensor(sucs).to(self.device), torch.tensor(spls).to(self.device), torch.tensor(oss).to(self.device), torch.tensor(ones).to(self.device), torch.tensor(len(sucs)).to(self.device))

python streamvln/streamvln_eval.py --model_path "data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln" --num_frames 8

sbatch scripts/streamvln_train_slurm.sh

复现 StreamVLN 具身导航：流式视觉语言导航实战

StreamVLN 具身导航复现指南

1. 环境搭建

安装 Habitat 仿真环境

2. 依赖库与补丁

3. 数据集准备

1) Matterport3D (MP3D) Scenes

2) VLN-CE Episodes

更多推荐文章

相关免费在线工具

3) Collected Trajectory Data

4. 模型权重下载

5. 模型评估与推理

推理结果解析

核心代码逻辑

单 GPU 评估

6. 模型训练

更多推荐文章

相关免费在线工具

复现 StreamVLN 具身导航：流式视觉语言导航实战

StreamVLN 具身导航复现指南

1. 环境搭建

安装 Habitat 仿真环境

2. 依赖库与补丁

3. 数据集准备

1) Matterport3D (MP3D) Scenes

2) VLN-CE Episodes

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

3) Collected Trajectory Data

4. 模型权重下载

5. 模型评估与推理

推理结果解析

核心代码逻辑

单 GPU 评估

6. 模型训练

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具