Breaking Down and Building Up:
Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma1, Yue Zhang1, Zehao Wang2, Parisa Kordjamshidi1

1Michigan State University  ยท  2ESAT-PSI, KU Leuven

Skill Decomposition motivation
SkillNav decomposes complex navigation instructions into a set of atomic skills — Stop & Pause, Directional Adjustment, Area & Region Identification, Landmark Detection, Vertical Movement — and recomposes them at inference via a training-free VLM-based router.

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and previous actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization on GSA-R2R, a benchmark with novel instruction styles and unseen environments.

Architecture

SkillNav architecture
SkillNav architecture. An LLM-guided Temporal Reordering module converts a free-form instruction into a structured sequence of sub-goals. At each step, a training-free VLM Action Router (Subgoal Localizer + Skill Router) inspects the current observation, action history, and remaining sub-goals to dispatch one of five skill-specialized agents (SP DA AR LD VM). Each specialist is a Transformer VLN policy fine-tuned on synthetic skill-specific data.

Skill Collection & Synthetic Data

Dataset path distribution
Path-length and step-count distributions of the synthesized skill-specific trajectories, compared against the original R2R distribution.

Main Results — R2R & GSA-R2R

Methods # R2R GSA-R2R
Val-Unseen Test-Unseen Test-R-Basic Test-N-Basic Test-N-Scene
NE↓OSR↑SR↑SPL↑ NE↓OSR↑SR↑SPL↑ SR↑SPL↑ SR↑SPL↑ SR↑SPL↑
LLM-based VLN
MapGPT (GPT-4v)15.63584435--------343025232523
NavCoT (LLaMA2)26.26423429--------373529262926
NavGPT-2 (FlanT5-5B)33.138172613.33807260584548355743
NaviLLM (Vicuna-7B)43.51--67593.71--6860------------
DiscussNav (GPT-4)55.32614340--------------------
Supervised VLN
HAMT62.29--66613.93726560484442383430
DUET73.318172603.65766959584748374030
BEVBERT82.818475643.13817362584546353927
GR-DUET9----------------696457524843
SAME†102.73--76663.03--7464------------
ScaleVLN†112.398879702.73847768786769575543
SRDF†121.838984781.88888477716359495243
Mixture of Skill-based VLN (Ours)
SkillNav (ScaleVLN-Aug)†131.978983772.53837870796972615748
  Δ vs. ScaleVLN−0.42+1.77+3.36+6.54−0.20−1.65+0.88+1.80+0.71+2.18+2.45+4.18+2.16+5.26
SkillNav (SRDF-Aug)†141.798984781.76878477716461505445
  Δ vs. SRDF−0.04−0.26+0.20+0.22−0.12−0.82−0.28+0.09+0.56+1.02+2.83+0.87+2.78+2.02

† large-scale data augmentation. Bold = best, underline = second-best per column. For NE, − means lower (better). SkillNav (ScaleVLN-Aug) sets the new SOTA on every GSA-R2R split; SkillNav (SRDF-Aug) further pushes the R2R bound while remaining competitive on GSA-R2R.

Per-Skill Evaluation on NavNuances

Method DC VM LR RR
SR SROSRSPL SR SROSR
VLN AgentsScaleVLN68.3981.7688.8276.3428.3282.9195.27
SRDF59.9382.9491.1880.9826.2877.0994.55
Mixed Skills66.8484.1187.6579.2248.9081.8290.91
Skill-based
Agents (Ours)
Directional Adjustment (DA)70.8181.7691.1876.2831.3981.8294.91
Vertical Movement (VM)70.6887.6589.4183.8330.2282.1896.00
Landmark Detection (LD)70.2982.3585.2978.9431.5383.6497.09
Area & Region Ident. (AR)67.5384.1288.8280.4929.2085.0996.36
Stop & Pause (SP)68.9184.7187.0680.6729.7883.6497.09

Each skill agent excels in its own skill domain. DC = Direction Change, VM = Vertical Movement, LR = Landmark Recognition, RR = Region Recognition. Following NavNuances, metric sets differ across splits.

Action Router — Ablations

Temporal Reordering

ReorderRouter Test-R-BasicTest-N-BasicTest-N-Scene
SRSPL SRSPL SRSPL
Qwen78.4267.8071.0159.6255.4645.43
Qwen78.8368.8871.5861.3456.6647.96
GLM77.4666.2770.7058.6355.6242.64
GLM78.6067.9371.1359.7356.8046.51

Disabling Temporal Reordering hurts every split by ~1–2.5 SPL points — explicit decomposition is a structural scaffold for generalization, not optional.

VLM Router Choice

Router Test-R-BasicTest-N-BasicTest-N-Scene
SRSPL SRSPL SRSPL
Random78.3967.4670.9359.7154.6143.17
GLM-4.1V-9B78.6067.9371.1359.7356.8046.51
Qwen2.5-VL-7B78.8368.8871.5861.3456.6647.96
GPT-4o79.4169.1872.7562.4858.1648.96

With temporal reordering enabled, performance is robust across router VLMs; GPT-4o adds another ~1 SPL on the hardest Test-N-Scene split.

Skill Usage & Runtime

Expert call distribution (GSA-R2R Test-N-Scene)

πSPπDA πARπLDπVM
34.42%23.61%18.75%14.23%8.99%

Control-focused skills (SP + DA) dominate (~58%); semantic skills (AR, LD) act as sparse anchors; VM is the rarest, matching topological sparsity of stairs/elevators in MP3D.

Runtime

MethodSplitRuntime (s)Inferences/s
ScaleVLNTest-R-Basic513.828.03
Test-N-Basic342.726.26
MapGPTTest-R-Basic~597,0000.02
Test-N-Basic~373,0000.02
SkillNav (Qwen)Test-R-Basic~27,0000.54
Test-N-Basic~18,3600.49

SkillNav sits between fast supervised baselines and LLM-only agents: ~25× faster than MapGPT while gaining substantial generalization on GSA-R2R.

Qualitative Example

SkillNav qualitative example
A trajectory from GSA-R2R Test-N-Scene. The instruction is decomposed into ordered sub-goals; the VLM router dispatches a different skill specialist at each step based on the current panorama and remaining sub-goal, producing the correct turn-and-stop behavior where monolithic baselines fail.

RxR-English Generalization

RxR-English generalization
Zero-shot transfer to RxR-English (no fine-tuning). SkillNav improves both backbones on SR, SPL, and nDTW.

BibTeX

@misc{ma2025breakingbuildingupmixture,
  title  = {Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents},
  author = {Tianyi Ma and Yue Zhang and Zehao Wang and Parisa Kordjamshidi},
  year   = {2025},
  eprint = {2508.07642},
  archivePrefix = {arXiv},
  primaryClass = {cs.AI},
  url    = {https://arxiv.org/abs/2508.07642}
}