SkillNav: Mixture of Skill-Based Vision-and-Language Navigation Agents

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and previous actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization on GSA-R2R, a benchmark with novel instruction styles and unseen environments.

Architecture

Skill Collection & Synthetic Data

Skill instruction distribution on R2R — Distribution of skill-tagged instructions in our synthetic R2R training corpus, grouped by atomic skill.

Dataset path distribution — Path-length and step-count distributions of the synthesized skill-specific trajectories, compared against the original R2R distribution.

Main Results — R2R & GSA-R2R

Methods	#	R2R								GSA-R2R
		Val-Unseen				Test-Unseen				Test-R-Basic		Test-N-Basic		Test-N-Scene
		NE↓	OSR↑	SR↑	SPL↑	NE↓	OSR↑	SR↑	SPL↑	SR↑	SPL↑	SR↑	SPL↑	SR↑	SPL↑
LLM-based VLN
MapGPT (GPT-4v)	1	5.63	58	44	35	--	--	--	--	34	30	25	23	25	23
NavCoT (LLaMA2)	2	6.26	42	34	29	--	--	--	--	37	35	29	26	29	26
NavGPT-2 (FlanT5-5B)	3	3.13	81	72	61	3.33	80	72	60	58	45	48	35	57	43
NaviLLM (Vicuna-7B)	4	3.51	--	67	59	3.71	--	68	60	--	--	--	--	--	--
DiscussNav (GPT-4)	5	5.32	61	43	40	--	--	--	--	--	--	--	--	--	--
Supervised VLN
HAMT	6	2.29	--	66	61	3.93	72	65	60	48	44	42	38	34	30
DUET	7	3.31	81	72	60	3.65	76	69	59	58	47	48	37	40	30
BEVBERT	8	2.81	84	75	64	3.13	81	73	62	58	45	46	35	39	27
GR-DUET	9	--	--	--	--	--	--	--	--	69	64	57	52	48	43
SAME†	10	2.73	--	76	66	3.03	--	74	64	--	--	--	--	--	--
ScaleVLN†	11	2.39	88	79	70	2.73	84	77	68	78	67	69	57	55	43
SRDF†	12	1.83	89	84	78	1.88	88	84	77	71	63	59	49	52	43
Mixture of Skill-based VLN (Ours)
SkillNav (ScaleVLN-Aug)†	13	1.97	89	83	77	2.53	83	78	70	79	69	72	61	57	48
Δ vs. ScaleVLN		−0.42	+1.77	+3.36	+6.54	−0.20	−1.65	+0.88	+1.80	+0.71	+2.18	+2.45	+4.18	+2.16	+5.26
SkillNav (SRDF-Aug)†	14	1.79	89	84	78	1.76	87	84	77	71	64	61	50	54	45
Δ vs. SRDF		−0.04	−0.26	+0.20	+0.22	−0.12	−0.82	−0.28	+0.09	+0.56	+1.02	+2.83	+0.87	+2.78	+2.02

† large-scale data augmentation. Bold = best, underline = second-best per column. For NE, − means lower (better). SkillNav (ScaleVLN-Aug) sets the new SOTA on every GSA-R2R split; SkillNav (SRDF-Aug) further pushes the R2R bound while remaining competitive on GSA-R2R.

Per-Skill Evaluation on NavNuances

	Method	DC	VM			LR	RR
	Method	SR	SR	OSR	SPL	SR	SR	OSR
VLN Agents	ScaleVLN	68.39	81.76	88.82	76.34	28.32	82.91	95.27
	SRDF	59.93	82.94	91.18	80.98	26.28	77.09	94.55
	Mixed Skills	66.84	84.11	87.65	79.22	48.90	81.82	90.91
Skill-based Agents (Ours)	Directional Adjustment (DA)	70.81	81.76	91.18	76.28	31.39	81.82	94.91
	Vertical Movement (VM)	70.68	87.65	89.41	83.83	30.22	82.18	96.00
	Landmark Detection (LD)	70.29	82.35	85.29	78.94	31.53	83.64	97.09
	Area & Region Ident. (AR)	67.53	84.12	88.82	80.49	29.20	85.09	96.36
	Stop & Pause (SP)	68.91	84.71	87.06	80.67	29.78	83.64	97.09

Each skill agent excels in its own skill domain. DC = Direction Change, VM = Vertical Movement, LR = Landmark Recognition, RR = Region Recognition. Following NavNuances, metric sets differ across splits.

Action Router — Ablations

Temporal Reordering

Reorder	Router	Test-R-Basic		Test-N-Basic		Test-N-Scene
Reorder	Router	SR	SPL	SR	SPL	SR	SPL
✗	Qwen	78.42	67.80	71.01	59.62	55.46	45.43
✓	Qwen	78.83	68.88	71.58	61.34	56.66	47.96
✗	GLM	77.46	66.27	70.70	58.63	55.62	42.64
✓	GLM	78.60	67.93	71.13	59.73	56.80	46.51

Disabling Temporal Reordering hurts every split by ~1–2.5 SPL points — explicit decomposition is a structural scaffold for generalization, not optional.

VLM Router Choice

Router	Test-R-Basic		Test-N-Basic		Test-N-Scene
Router	SR	SPL	SR	SPL	SR	SPL
Random	78.39	67.46	70.93	59.71	54.61	43.17
GLM-4.1V-9B	78.60	67.93	71.13	59.73	56.80	46.51
Qwen2.5-VL-7B	78.83	68.88	71.58	61.34	56.66	47.96
GPT-4o	79.41	69.18	72.75	62.48	58.16	48.96

With temporal reordering enabled, performance is robust across router VLMs; GPT-4o adds another ~1 SPL on the hardest Test-N-Scene split.

Skill Usage & Runtime

Expert call distribution (GSA-R2R Test-N-Scene)

π_SP	π_DA	π_AR	π_LD	π_VM
34.42%	23.61%	18.75%	14.23%	8.99%

Control-focused skills (SP + DA) dominate (~58%); semantic skills (AR, LD) act as sparse anchors; VM is the rarest, matching topological sparsity of stairs/elevators in MP3D.

Runtime

Method	Split	Runtime (s)	Inferences/s
ScaleVLN	Test-R-Basic	513.8	28.03
ScaleVLN	Test-N-Basic	342.7	26.26
MapGPT	Test-R-Basic	~597,000	0.02
MapGPT	Test-N-Basic	~373,000	0.02
SkillNav (Qwen)	Test-R-Basic	~27,000	0.54
SkillNav (Qwen)	Test-N-Basic	~18,360	0.49

SkillNav sits between fast supervised baselines and LLM-only agents: ~25× faster than MapGPT while gaining substantial generalization on GSA-R2R.

Qualitative Example

RxR-English Generalization

BibTeX

@misc{ma2025breakingbuildingupmixture,
  title  = {Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents},
  author = {Tianyi Ma and Yue Zhang and Zehao Wang and Parisa Kordjamshidi},
  year   = {2025},
  eprint = {2508.07642},
  archivePrefix = {arXiv},
  primaryClass = {cs.AI},
  url    = {https://arxiv.org/abs/2508.07642}
}

Breaking Down and Building Up:Mixture of Skill-Based Vision-and-Language Navigation Agents