AgentVLN: Towards Agentic Vision-and-Language Navigation

Zihao Xin¹, Wentong Li^1,†✉, Yixuan Jiang¹, Ziyuan Huang¹,
Bin Wang², Piji Li¹, Jianke Zhu³, Jie Qin¹, Sheng-Jun Huang^1,✉, ¹Nanjing University of Aeronautics and Astronautics
² Shandong University
³ Zhejiang University

^†Project Lead ^✉Corresponding Authors

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D–3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models.

Habitat Demos

VLM-as-Brain Paradigm

Unlike prevailing end-to-end approaches that require massive video datasets for pre-training to implicitly learn spatial geometry into policies, AgentVLN adopts a VLM-as-Brain embodied paradigm. This paradigm explicitly decouples high-level cognitive reasoning from low-level skill execution.

Real-World Deployment

Navigation results in real-world indoor and outdoor environments. Experimental results demonstrate that, regardless of whether the agent is navigating through complex, confined indoor spaces or outdoor scenarios with challenging illumination conditions, the proposed model consistently and accurately comprehends natural language instructions, enabling it to rapidly plan and execute precise navigation trajectories.

BibTeX

@misc{xin2026agentvlnagenticvisionandlanguagenavigation, title={AgentVLN: Towards Agentic Vision-and-Language Navigation}, author={Zihao Xin and Wentong Li and Yixuan Jiang and Ziyuan Huang and Bin Wang and Piji Li and Jianke Zhu and Jie Qin and Shengjun Huang}, year={2026}, eprint={2603.17670}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2603.17670}, }

AgentVLN: Towards Agentic Vision-and-Language Navigation

Abstract

Overview of the AgentVLN framework. AgentVLN employs a VLM-as-Brain paradigm, decomposing long-horizon navigation into modular skill executions. Additionally, a context-driven fine-grained strategy and QD-PCoT mitigate localization errors and scale ambiguities, ensuring precise 3D target grounding.

Habitat Demos

VLM-as-Brain Paradigm

Real-World Deployment

Experimental Results

Comparison results on the Val-Unseen dataset for R2R-CE.

Comparison results on the Val-Unseen dataset for RxR-CE.

BibTeX