Vision-and-Language Navigation (VLN) poses significant challenges in enabling agents to interpret natural
language instructions and navigate complex 3D environments. While recent progress has been driven by
large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen
scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose
SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN
agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical
Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. We then
introduce a novel zero-shot Vision-Language Model (VLM)-based router, which dynamically selects the most
suitable agent at each time step by aligning sub-goals with visual observations and historical actions.
SkillNav achieves a new state-of-the-art performance on the R2R benchmark and demonstrates strong
generalization to the GSA-R2R benchmark that includes novel instruction styles and unseen environments.