StreamVLN 具身导航复现与模型推理指南
StreamVLN 具身导航模型的复现流程。内容包括创建 Conda 环境,安装 Habitat 仿真环境及依赖库,准备 Matterport3D 场景、VLN-CE 片段及轨迹数据。提供了基准测试与真实世界部署的模型权重下载方式。详细说明了多 GPU 及单 GPU 下的评估推理命令,展示了显存占用与输出结果格式。最后给出了基于 Slurm 的分布式训练指令。旨在帮助开发者快速搭建并运行流式视觉语言导航系统。

StreamVLN 具身导航模型的复现流程。内容包括创建 Conda 环境,安装 Habitat 仿真环境及依赖库,准备 Matterport3D 场景、VLN-CE 片段及轨迹数据。提供了基准测试与真实世界部署的模型权重下载方式。详细说明了多 GPU 及单 GPU 下的评估推理命令,展示了显存占用与输出结果格式。最后给出了基于 Slurm 的分布式训练指令。旨在帮助开发者快速搭建并运行流式视觉语言导航系统。

StreamVLN 通过在线、多轮对话的方式,输入连续视频,输出动作序列。通过结合语言指令、视觉观测和空间位姿信息,驱动模型生成导航动作(前进、左转、右转、停止)。
论文地址:StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
代码地址:https://github.com/OpenRobotLab/StreamVLN
本文介绍 StreamVLN 的复现流程及模型推理方法。
下面是示例效果:

首先创建一个 Conda 环境,名字为 streamvln,python 版本为 3.9;然后进入 streamvln 环境。
conda create -n streamvln python=3.9
conda activate streamvln
先安装 habitat-sim,执行下面命令进行安装:
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
再安装 habitat-lab:
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab # install habitat_lab
pip install -e habitat-baselines # install habitat_baselines
获取 StreamVLN 的代码:
git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN
安装其他依赖库:
pip install -r requirements.txt
2025/7/23 补丁:需安装 protobuf==3.20.1
pip install protobuf==3.20.1
需要准备三种类型的数据,新建一个 data 文件夹来存放:
1) Matterport3D (MP3D) Scenes
快速下载地址:https://cloud.tsinghua.edu.cn/f/03e0ca1430a344efa72b/?dl=1

每个文件夹中,包含一个.glb 文件:

如果是想要完整 MP3D 数据,推荐使用'批量下载'方式~(可选)
2) VLN-CE Episodes
下载 VLN-CE episodes 的链接,然后重命名:
最后,将它们解压到 data/datasets/ 目录中。
3) Collected Trajectory Data
作者提供预先收集的观察 - 动作轨迹数据用于训练;这些轨迹是在 Matterport3D 环境下使用R2R和RxR的训练片段收集的。
下载链接:https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/README.md

下载好上面三个数据集后,文件夹结构应如下所示:
data/
├── datasets/
│ ├── r2r/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ └── val_seen.json.gz
│ │ └── val_unseen/
│ │ └── val_unseen.json.gz
│ ├── rxr/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ ├── val_seen_guide.json.gz
│ │ │ └── ...
│ │ └── val_unseen/
│ │ ├── val_unseen_guide.json.gz
│ │ └── ...
│ └── envdrop/
│ ├── envdrop.json.gz
│ └── ...
├── scene_datasets/
│ └── mp3d/
│ ├── 17DRP5sb8fy/
│ ├── 1LXtFkjw3qL/
│ └── ...
└── trajectory_data/
├── R2R/
│ ├── images/
│ └── annotations.json
├── RxR/
│ ├── images/
│ └── annotations.json
└── EnvDrop/
├── images/
└── annotations.json
提供了两个模型权重:
模型权重 1:基准测试重现(仿真环境)
使用此权重来重现 VLN-CE 基准测试的结果,链接:
https://huggingface.co/mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln

下载好的模型权重,存放在 data 目录下,比如:
data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln
模型权重 2:真实世界部署
下载链接:https://huggingface.co/mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln_real_world
做了两处修改:

为测试实际场景适用性,在真实环境中用 Unitree Go2 机器狗部署 StreamVLN:
修改 StreamVLN-main/scripts/streamvln_eval_multi_gpu.sh
模型权重路径:CHECKPOINT="data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln"
1)多 GPU 评估推理
修改 streamvln_eval_multi_gpu.sh 为:
export MAGNUM_LOG=quiet HABITAT_SIM_LOG=quiet MASTER_PORT=$((RANDOM % 101 + 20000)) CHECKPOINT="data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln" echo "CHECKPOINT: ${CHECKPOINT}" torchrun --nproc_per_node=4 --master_port=$MASTER_PORT streamvln/streamvln_eval.py --model_path $CHECKPOINT
其中的--nproc_per_node=4,根据具体的显卡数量来修改。
执行命令:
sh scripts/streamvln_eval_multi_gpu.sh
打印信息:
CHECKPOINT: data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln
[2025-07-23 19:02:56,669] torch.distributed.run: [WARNING]
[2025-07-23 19:02:56,669] torch.distributed.run: [WARNING] ******************************
[2025-07-23 19:02:56,669] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-07-23 19:02:56,669] torch.distributed.run: [WARNING] ******************************
| distributed init (rank 0): env://, gpu 0
| distributed init (rank 1): env://, gpu 1
| distributed init (rank 2): env://, gpu 2
| distributed init (rank 3): env://, gpu 3
Sliding Window Attention is enabled but not implemented foreager; unexpected results may be encountered.
...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5.62it/s]
2025-07-23 19:03:13,045 Initializing dataset R2RVLN-v1
...
显存占用情况:
| 0 N/A N/A 1507162 C+G .../envs/streamvln/bin/python3.9 29085MiB |
| 0 N/A N/A 1507163 G .../envs/streamvln/bin/python3.9 825MiB |
| 0 N/A N/A 1507164 G .../envs/streamvln/bin/python3.9 825MiB |
| 0 N/A N/A 1507165 G .../envs/streamvln/bin/python3.9 825MiB |
| 1 N/A N/A 1507163 C .../envs/streamvln/bin/python3.9 36670MiB |
| 2 N/A N/A 1507164 C .../envs/streamvln/bin/python3.9 36982MiB |
| 3 N/A N/A 1507165 C .../envs/streamvln/bin/python3.9 32280MiB |
+-----------------------------------------------------------------------------------------+
输出结果:
[19:13:20.799677] 32 You are an autonomous navigation assistant. Your task is to Go straight through the doorway.
Go to the left and then left again till you see the star burst pattern on the floor Go through the bedroom doorway and stop when you get to the bed.
Wait there. Devise an action sequence to follow the instruction using the four actions:
TURN LEFT (←) or TURN RIGHT (→) by 15 degrees, MOVE FORWARD (↑) by 25 centimeters, or STOP. These are your historical observations .[19:13:21.780076] <|im_start|>assistant
↑↑↑→<|im_end|>
[19:13:21.780184] 解析的动作序列 [1, 1, 1, 3]
[19:13:21.780284] <|im_start|>assistant
↑↑↑↑<|im_end|>
[19:13:21.780384] 解析的动作序列 [1, 1, 1, 1]
场景 2azQ1b91cZZ: 22%|███████████████████████▌ | 14/63 [07:46<27:21, 33.49s/it] ...
官方效果 - 原版代码可视化:

改进版:(streamvln/streamvln_eval.py)
# 导入系统相关库
import sys
import os
# 将上级目录添加到系统路径,便于导入自定义模块
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
# 导入正则表达式、进度条、PyTorch 等工具库
import re
import tqdm
import torch
import copy
import json
import random
import argparse
import itertools
import quaternion
import transformers
import numpy as np
# 导入类型注解、配置工具、图像处理等库
from typing import Any
from omegaconf import OmegaConf
from PIL import Image, ImageFile, ImageDraw, ImageFont
from collections import OrderedDict
from torch.nn.utils.rnn import pad_sequence
# 导入深度图像过滤函数
from depth_camera_filtering import filter_depth
from transformers.image_utils import to_numpy_array
# 导入 Habitat 环境相关库(用于导航模拟)
import habitat
from habitat import logger, Env
from habitat_extensions import measures
from habitat.config.default import get_agent_config
from habitat_baselines.config.default import get_config as get_habitat_config
from habitat.config.default_structured_configs (
CollisionsMeasurementConfig,
FogOfWarConfig,
TopDownMapMeasurementConfig,
)
habitat.utils.visualizations maps
habitat.utils.visualizations.utils images_to_video, observations_to_image
model.stream_video_vln StreamVLNForCausalLM
utils.utils dict_to_cuda
utils.dist *
utils.utils DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX, DEFAULT_MEMORY_TOKEN, MEMORY_TOKEN_INDEX
:
():
.args = args
.device = torch.device()
.split = split
.env_num = env_num
.save_video = args.save_video
.output_path = output_path
.epoch = epoch
.config_path = config_path
.config = get_habitat_config(config_path)
.agent_config = get_agent_config(.config.habitat.simulator)
.sim_sensors_config = .config.habitat.simulator.agents.main_agent.sim_sensors
habitat.config.read_write(.config):
.config.habitat.dataset.split = .split
.config.habitat.task.measurements.update(
{
: TopDownMapMeasurementConfig(
map_padding=,
map_resolution=,
draw_source=,
draw_border=,
draw_shortest_path=,
draw_view_points=,
draw_goal_positions=,
draw_goal_aabbs=,
fog_of_war=FogOfWarConfig(
draw=,
visibility_dist=,
fov=,
),
),
: CollisionsMeasurementConfig(),
}
)
()
(OmegaConf.to_yaml(.config))
._camera_height = .sim_sensors_config.rgb_sensor.position[]
._min_depth = .sim_sensors_config.depth_sensor.min_depth
._max_depth = .sim_sensors_config.depth_sensor.max_depth
camera_fov_rad = np.deg2rad(.sim_sensors_config.depth_sensor.hfov)
._camera_fov = camera_fov_rad
._fx = ._fy = .sim_sensors_config.depth_sensor.width / ( * np.tan(camera_fov_rad / ))
.image_processor = model.get_vision_tower().image_processor
.model = model
.tokenizer = tokenizer
prompt =
.conversation = [{: , : prompt}, {: , : }]
.actions2idx = OrderedDict({
: [],
: [],
: [],
: []
})
.conjunctions = [
,
,
,
,
,
,
]
.num_frames = args.num_frames
.num_future_steps = args.num_future_steps
.num_history = args.num_history
():
target_height = .image_processor.crop_size[]
target_width = .image_processor.crop_size[]
resized_depth_image = depth_image.resize((target_width, target_height), Image.NEAREST)
img = to_numpy_array(resized_depth_image)
do_depth_scale:
img = img / depth_scale
img, (target_width, target_height)
() -> np.ndarray:
width = sensor_cfg.width
height = sensor_cfg.height
fov = sensor_cfg.hfov
fx = (width / ) / np.tan(np.deg2rad(fov / ))
fy = fx
cx = (width - ) /
cy = (height - ) /
intrinsic_matrix = np.array([
[fx, , cx, ],
[, fy, cy, ],
[, , , ],
[, , , ]
])
intrinsic_matrix
():
intrinsic = copy.deepcopy(intrinsic)
(intrinsic.shape) == :
intrinsic = intrinsic[, :, :]
intrinsic[:, ] /= ori_size[] / target_size[]
intrinsic[:, ] /= ori_size[] / target_size[]
intrinsic[:, , ] -= (target_size[] - target_size[]) /
intrinsic.shape[] == :
intrinsic = intrinsic.squeeze()
intrinsic
():
ma = torch.tensor([[, , , ], [-, , , ], [, -, , ], [, , , ]]).double()
ma
() -> np.ndarray:
x, y, z = xyz
transformation_matrix = np.array(
[
[np.cos(yaw), -np.sin(yaw), , x],
[np.sin(yaw), np.cos(yaw), , y],
[, , , z],
[, , , ]
]
)
transformation_matrix
() -> Env:
env = Env(config=.config)
env
() -> :
env = .config_env()
scene_episode_dict = {}
episode env.episodes:
episode.scene_id scene_episode_dict:
scene_episode_dict[episode.scene_id] = []
scene_episode_dict[episode.scene_id].append(episode)
intrinsic_matrix = .get_intrinsic_matrix(.config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor)
sucs, spls, oss, ones = [], [], [], []
done_res = []
os.path.exists(os.path.join(.output_path, )):
(os.path.join(.output_path, ),) f:
line f.readlines():
res = json.loads(line)
done_res.append([res[], res[], res[]])
get_rank() == :
scene (scene_episode_dict.keys()):
episodes = scene_episode_dict[scene]
scene_id = scene.split()[-]
()
process_bar = tqdm.tqdm(((episodes[idx::.env_num])), desc=)
episode episodes[idx::.env_num]:
episode_instruction = episode.instruction.instruction_text .config_path episode.object_category
(, episode_instruction)
episode_id = episode.episode_id
[scene_id, episode_id, episode_instruction] done_res:
.model.reset_for_env(idx)
env.current_episode = episode
observations = env.reset()
os.makedirs(os.path.join(.output_path, ), exist_ok=)
Image.fromarray(observations[]).save(os.path.join(.output_path, , ))
vis_frames = []
step_id =
.save_video:
os.makedirs(os.path.join(.output_path, , ), exist_ok=)
initial_height = env.sim.get_agent_state().position[]
rgb_list = []
depth_list = []
depth_images_list = []
pose_list = []
intrinsic_list = []
time_ids = []
action_seq = []
past_key_values =
output_ids =
env.episode_over:
.model.()
time_ids.append(step_id)
rgb = observations[]
depth = observations[]
x, y = observations[]
camera_yaw = observations[][]
depth = filter_depth(depth.reshape(depth.shape[:]), blur_type=)
depth = depth * (._max_depth - ._min_depth) + ._min_depth
depth = depth *
agent_state = env.sim.get_agent_state()
height = agent_state.position[] - initial_height
camera_position = np.array([x, -y, ._camera_height + height])
robot_xy = camera_position[:]
tf_camera_to_episodic = .xyz_yaw_to_tf_matrix(camera_position, camera_yaw)
rotation = agent_state.rotation
translation = agent_state.position
rotation_matrix = quaternion.as_rotation_matrix(rotation)
transformation_matrix = np.eye()
transformation_matrix[:, :] = rotation_matrix
transformation_matrix[:, ] = translation
image = Image.fromarray(rgb).convert()
image_size = image.size
image = .image_processor.preprocess(images=image, return_tensors=)[][]
depth_image, resize_shape = .preprocess_depth_image(Image.fromarray(depth.astype(np.uint16), mode=), do_depth_scale=)
intrinsic = .preprocess_instrinsic(intrinsic_matrix, image_size, resize_shape)
intrinsic = torch.from_numpy(intrinsic).()
rgb_list.append(image)
depth_list.append(torch.from_numpy(depth_image).())
pose_list.append(torch.from_numpy(tf_camera_to_episodic) @ .get_axis_align_matrix())
intrinsic_list.append(intrinsic)
episode_instruction = episode.instruction.instruction_text .config_path episode.object_category
info = env.get_metrics()
info[] :
frame = observations_to_image({:observations[]}, info)
frame_pil = Image.fromarray(frame)
draw = ImageDraw.Draw(frame_pil)
img_width, img_height = frame_pil.size
task_text =
metrics = env.get_metrics()
result_text = (
)
full_text =
base_font_size = (img_height * )
margin = (img_height * )
line_spacing = (base_font_size * )
text_color = (, , )
bg_color = (, , , )
:
font = ImageFont.truetype(, base_font_size)
:
font = ImageFont.load_default()
max_line_width = (img_width * )
():
lines = []
current_line =
char text:
char == :
lines.append(current_line)
current_line =
test_line = current_line + char
bbox = draw.textbbox((, ), test_line, font=font)
(bbox[] - bbox[]) > max_width:
lines.append(current_line)
current_line = char
:
current_line = test_line
current_line:
lines.append(current_line)
lines
wrapped_lines = wrap_text(full_text, font, max_line_width)
line_height = base_font_size + line_spacing
total_text_height = ((wrapped_lines) * line_height) - line_height
max_line_width_actual =
line wrapped_lines:
bbox = draw.textbbox((, ), line, font=font)
line_width = bbox[] - bbox[]
line_width > max_line_width_actual:
max_line_width_actual = line_width
bg_x1 = margin
bg_y1 = margin
bg_x2 = bg_x1 + max_line_width_actual + * margin
bg_y2 = bg_y1 + total_text_height + * margin
bg_y2 > img_height - margin:
bg_y2 = img_height - margin
bg_x2 > img_width * + margin:
bg_x2 = (img_width * ) + margin
draw.rectangle([bg_x1, bg_y1, bg_x2, bg_y2], fill=bg_color)
text_x = bg_x1 + margin
text_y = bg_y1 + margin
line wrapped_lines:
text_y + base_font_size > bg_y2 - margin:
draw.text((text_x, text_y), line, font=font, fill=text_color)
text_y += line_height
frame = np.array(frame_pil)
vis_frames.append(frame)
(action_seq) == :
output_ids :
sources = copy.deepcopy(.conversation)
sources[][] = sources[][].replace(, )
step_id != :
sources[][] +=
sources[][] = sources[][].replace(DEFAULT_VIDEO_TOKEN+, )
sources[][] = sources[][].replace(, episode.instruction.instruction_text)
add_system =
(step_id, sources[][])
:
sources = [{: , : }, {: , : }]
add_system =
input_ids, conversations = .preprocess_qwen([sources], .tokenizer, , add_system=add_system)
output_ids :
input_ids = torch.cat([output_ids, input_ids.to(output_ids.device)], dim=)
images = rgb_list[-:]
depths = depth_list[-:]
poses = pose_list[-:]
intrinsics = intrinsic_list[-:]
step_id != step_id % .num_frames == :
.num_history :
history_ids = (, time_ids[], .num_future_steps)
:
history_ids = (, time_ids[], (time_ids[] // .num_history))
images = rgb_list[history_ids] + images
depths = depth_list[history_ids] + depths
poses = pose_list[history_ids] + poses
intrinsics = intrinsic_list[history_ids] + intrinsics
input_dict = {
: torch.stack(images).unsqueeze(),
: torch.stack(depths).unsqueeze(),
: torch.stack(poses).unsqueeze(),
: torch.stack(intrinsics).unsqueeze(),
: input_ids,
: idx,
: [time_ids],
: []
}
input_dict = dict_to_cuda(input_dict, .device)
key, value input_dict.items():
key [, , , ]:
input_dict[key] = input_dict[key].to(torch.bfloat16)
outputs = .model.generate(
**input_dict,
do_sample=,
num_beams=,
max_new_tokens=,
use_cache=,
return_dict_in_generate=,
past_key_values=past_key_values
)
output_ids = outputs.sequences
past_key_values = outputs.past_key_values
llm_outputs = .tokenizer.batch_decode(output_ids, skip_special_tokens=)[].strip()
(llm_outputs, flush=)
action_seq = .parse_actions(llm_outputs)
(, action_seq, flush=)
(action_seq) == :
action_seq = []
action = action_seq.pop()
observations = env.step(action)
step_id +=
step_id % .num_frames == :
.model.reset_for_env(idx)
output_ids =
past_key_values =
time_ids = []
process_bar.update()
metrics = env.get_metrics()
.save_video:
images_to_video(
vis_frames,
os.path.join(.output_path, ),
,
fps=,
quality=
)
vis_frames.clear()
sucs.append(metrics[])
spls.append(metrics[])
oss.append(metrics[])
ones.append(metrics[])
()
result = {
: scene_id,
: episode_id,
: metrics[],
: metrics[],
: metrics[],
: metrics[],
: step_id,
: episode_instruction
}
(os.path.join(.output_path, ), ) f:
f.write(json.dumps(result) + )
env.close()
(torch.tensor(sucs).to(.device), torch.tensor(spls).to(.device), torch.tensor(oss).to(.device), torch.tensor(ones).to(.device), torch.tensor((sucs)).to(.device))
():
action_patterns = .join(re.escape(action) action .actions2idx)
regex = re.(action_patterns)
matches = regex.findall(output)
actions = [.actions2idx[] matches]
actions = itertools.chain.from_iterable(actions)
(actions)
():
roles = {: , : }
tokenizer = copy.deepcopy(tokenizer)
has_image:
tokenizer.add_tokens([], special_tokens=)
tokenizer.add_tokens([], special_tokens=)
image_token_index = tokenizer.convert_tokens_to_ids()
memory_token_index = tokenizer.convert_tokens_to_ids()
im_start, im_end = tokenizer.additional_special_tokens_ids
unmask_tokens_idx = [, im_start, im_end]
nl_tokens = tokenizer().input_ids
chat_template =
tokenizer.chat_template = chat_template
conversations = []
input_ids = []
i, source (sources):
prompt = random.choice(.conjunctions) + DEFAULT_IMAGE_TOKEN
(source[][]) != :
source[][] +=
:
source[][] =
roles[source[][]] != roles[]:
source = source[:]
input_id = []
add_system:
input_id += tokenizer.apply_chat_template([{ : , : system_message}])
conv source:
:
role = conv[]
content = conv[]
:
role = conv[]
content = conv[]
role = roles.get(role, role)
conv = [{ : role, : content}]
conversations.append(content)
encode_id = tokenizer.apply_chat_template(conv)
input_id += encode_id
idx, encode_id (input_id):
encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
encode_id == memory_token_index:
input_id[idx] = MEMORY_TOKEN_INDEX
input_ids.append(input_id)
input_ids = torch.tensor(input_ids, dtype=torch.long)
input_ids, conversations
():
lens :
lens = [t.size() t tensors]
(lens) == lens[] == max_len:
tensors
max_len :
max_len = (lens)
bs = (tensors)
hid = tensors[].shape[:]
dtype = tensors[].dtype
output = torch.zeros(bs, max_len, *hid, dtype=dtype).to(tensors[].device)
pad:
output.data.fill_(pad)
i, (t, l) ((tensors, lens)):
output.data[i, :l, ...] = t.data
output
():
local_rank
parser = argparse.ArgumentParser()
parser.add_argument(, default=, =, =)
parser.add_argument(, =, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, default=, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, =, default=, =)
parser.add_argument(, default=, =, =)
parser.add_argument(, default=, =, =)
parser.add_argument(, default=, =, =)
parser.add_argument(, default=, =)
parser.add_argument(, default=, =)
parser.add_argument(, default=, =)
args = parser.parse_args()
init_distributed_mode(args)
local_rank = args.local_rank
tokenizer = transformers.AutoTokenizer.from_pretrained(
args.model_path,
model_max_length=args.model_max_length,
padding_side=
)
config = transformers.AutoConfig.from_pretrained(args.model_path)
model = StreamVLNForCausalLM.from_pretrained(
args.model_path,
attn_implementation=,
torch_dtype=torch.bfloat16,
config=config,
low_cpu_mem_usage=,
)
model.model.num_history = args.num_history
model.requires_grad_()
model.to(local_rank)
evaluate(model, tokenizer, args)
():
model.()
world_size = get_world_size()
model.reset(world_size)
evaluator = VLNEvaluator(
config_path=args.habitat_config_path,
split=args.eval_split,
env_num=world_size,
output_path=args.output_path,
model=model,
tokenizer=tokenizer,
epoch=,
args=args
)
sucs, spls, oss, ones, ep_num = evaluator.eval_action(get_rank())
ep_num_all = [torch.zeros_like(ep_num) _ (world_size)]
dist.all_gather(ep_num_all, ep_num)
sucs_all = [torch.zeros(ep_num_all[i], dtype=sucs.dtype).to(sucs.device) i (world_size)]
spls_all = [torch.zeros(ep_num_all[i], dtype=spls.dtype).to(spls.device) i (world_size)]
oss_all = [torch.zeros(ep_num_all[i], dtype=oss.dtype).to(oss.device) i (world_size)]
ones_all = [torch.zeros(ep_num_all[i], dtype=ones.dtype).to(ones.device) i (world_size)]
dist.barrier()
dist.all_gather(sucs_all, sucs)
dist.all_gather(spls_all, spls)
dist.all_gather(oss_all, oss)
dist.all_gather(ones_all, ones)
dist.barrier()
sucs_all = torch.cat(sucs_all, dim=)
spls_all = torch.cat(spls_all, dim=)
oss_all = torch.cat(oss_all, dim=)
ones_all = torch.cat(ones_all, dim=)
result_all = {
: ((sucs_all)/(sucs_all)).item(),
: ((spls_all)/(spls_all)).item(),
: ((oss_all)/(oss_all)).item(),
: ((ones_all)/(ones_all)).item(),
: (sucs_all)
}
(result_all)
get_rank() == :
(os.path.join(args.output_path, ), ) f:
f.write(json.dumps(result_all))
__name__ == :
()
可视化效果(实时显示任务、是否成功、到目标的距离)

2)单 GPU 评估推理
执行命令:(num_frames 设置小一些,默认 32 需要较大显存)
python streamvln/streamvln_eval.py --model_path "data/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln" --num_frames 8
打印信息:
(streamvln) lgp@lgp-MS-7E07:~/2025_project/StreamVLN-main$
Not using distributed mode
Sliding Window Attention is enabled but not implemented foreager; unexpected results may be encountered.
[19:45:30.773156] The checkpoint seems to containvision_towerweights:mm_tunable_partscontainsmm_vision_tower.
config.json: 576B [00:00, 60.4kB/s]
model.safetensors: 32%|██████████████████ | 1.14G/3.51G [05:15<16:04, 2.47MB/s]
model.safetensors: 100%|████████████████████████████████████████████████████████| 3.51G/3.51G [08:42<00:00, 6.73MB/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.20it/s]
............
使用分布式设置,执行多节点多 GPU 训练,运行指令:
sbatch scripts/streamvln_train_slurm.sh

微信公众号「极客日志」,在微信中扫描左侧二维码关注。展示文案:极客日志 zeeklog
使用加密算法(如AES、TripleDES、Rabbit或RC4)加密和解密文本明文。 在线工具,加密/解密文本在线工具,online
生成新的随机RSA私钥和公钥pem证书。 在线工具,RSA密钥对生成器在线工具,online
基于 Mermaid.js 实时预览流程图、时序图等图表,支持源码编辑与即时渲染。 在线工具,Mermaid 预览与可视化编辑在线工具,online
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。 在线工具,curl 转代码在线工具,online
将字符串编码和解码为其 Base64 格式表示形式即可。 在线工具,Base64 字符串编码/解码在线工具,online
将字符串、文件或图像转换为其 Base64 表示形式。 在线工具,Base64 文件转换器在线工具,online