Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

Michigan State University · ESAT‑PSI, KU Leuven

SkillNav Teaser


SkillNav decomposes complex navigation instructions into atomic skills and flexibly recomposes them to address diverse instruction styles and visual scenarios.

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments.

Architecture of SkillNav: LLM Reordering, VLM Router, and Skill-Based Agents

Sizes of model trees

SkillNav architecture. A temporal reordering module converts instructions into structured action goals. A VLM‑based router localizes the current sub‑goal and selects a specialized skill agent.

Performance Comparison

Methods # R2R GSA-R2R
Val-Unseen Test-Unseen Test-R-Basic Test-N-Basic Test-N-Scene
NE↓ OSR↑ SR↑ SPL↑ NE↓ OSR↑ SR↑ SPL↑ SR↑ SPL↑ SR↑ SPL↑ SR↑ SPL↑
LLM-based VLN
MapGPT (GPT4v) 1 5.63 58 44 35 -- -- -- -- 34 30 25 23 25 23
NavCoT (LLaMA2) 2 6.26 42 34 29 -- -- -- -- 37 35 29 26 29 26
NavGPT-2 (FlanT5-5B) 3 3.13 81 72 61 3.33 80 72 60 58 45 48 35 57 43
NaviLLM (Vicuna-7B) 4 3.51 -- 67 59 3.71 -- 68 60 -- -- -- -- -- --
Supervised VLN
HAMT 5 2.29 -- 66 61 3.93 72 65 60 48 44 42 38 34 30
DUET 6 3.31 81 72 60 3.65 76 69 59 58 47 48 37 40 30
BEVBERT 7 2.81 84 75 64 3.13 81 73 62 58 45 46 35 39 27
GR-DUET 8 -- -- -- -- -- -- -- -- 69 64 57 52 48 43
ScaleVLN † 9 2.34 87 79 70 2.73 84 77 68 78 67 69 57 55 43
SRDF † 10 1.83 89 84 78 1.88 88 84 77 71 63 59 49 52 43
Our Mixture of Skill-based VLN
SkillNav † (ours) 11 1.97 89 83 77 2.53 83 78 70 79 69 72 61 57 48

† indicates large-scale data augmentation. SRDF performs best on R2R due to extensive pretraining on data that mimics R2R-style instructions; however, it struggles to generalize effectively to the more challenging GSA-R2R dataset.

BibTeX

@misc{ma2025breakingbuildingupmixture,
  title={Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents},
  author={Tianyi Ma and Yue Zhang and Zehao Wang and Parisa Kordjamshidi},
  year={2025},
  eprint={2508.07642},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2508.07642}
}