Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments.

Architecture of SkillNav: LLM Reordering, VLM Router, and Skill-Based Agents

Sizes of model trees

SkillNav architecture. A temporal reordering module converts instructions into structured action goals. A VLM‑based router localizes the current sub‑goal and selects a specialized skill agent.

Performance Comparison

Methods	#	R2R								GSA-R2R
		Val-Unseen				Test-Unseen				Test-R-Basic		Test-N-Basic		Test-N-Scene
		NE↓	OSR↑	SR↑	SPL↑	NE↓	OSR↑	SR↑	SPL↑	SR↑	SPL↑	SR↑	SPL↑	SR↑	SPL↑
LLM-based VLN
MapGPT (GPT4v)	1	5.63	58	44	35	--	--	--	--	34	30	25	23	25	23
NavCoT (LLaMA2)	2	6.26	42	34	29	--	--	--	--	37	35	29	26	29	26
NavGPT-2 (FlanT5-5B)	3	3.13	81	72	61	3.33	80	72	60	58	45	48	35	57	43
NaviLLM (Vicuna-7B)	4	3.51	--	67	59	3.71	--	68	60	--	--	--	--	--	--
Supervised VLN
HAMT	5	2.29	--	66	61	3.93	72	65	60	48	44	42	38	34	30
DUET	6	3.31	81	72	60	3.65	76	69	59	58	47	48	37	40	30
BEVBERT	7	2.81	84	75	64	3.13	81	73	62	58	45	46	35	39	27
GR-DUET	8	--	--	--	--	--	--	--	--	69	64	57	52	48	43
ScaleVLN †	9	2.34	87	79	70	2.73	84	77	68	78	67	69	57	55	43
SRDF †	10	1.83	89	84	78	1.88	88	84	77	71	63	59	49	52	43
Our Mixture of Skill-based VLN
SkillNav † (ours)	11	1.97	89	83	77	2.53	83	78	70	79	69	72	61	57	48

† indicates large-scale data augmentation. SRDF performs best on R2R due to extensive pretraining on data that mimics R2R-style instructions; however, it struggles to generalize effectively to the more challenging GSA-R2R dataset.

BibTeX

@misc{ma2025breakingbuildingupmixture,
  title={Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents},
  author={Tianyi Ma and Yue Zhang and Zehao Wang and Parisa Kordjamshidi},
  year={2025},
  eprint={2508.07642},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.07642}
}