AgentVLN: Towards Agentic Vision-and-Language Navigation

Zihao Xin1, Wentong Li1,†✉, Yixuan Jiang1, Ziyuan Huang1,
Bin Wang2, Piji Li1, Jianke Zhu3, Jie Qin1, Sheng-Jun Huang1,✉,
1Nanjing University of Aeronautics and Astronautics
2 Shandong University
3 Zhejiang University

Project Lead    Corresponding Authors   

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D–3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models.

pipeline image

Overview of the AgentVLN framework. AgentVLN employs a VLM-as-Brain paradigm, decomposing long-horizon navigation into modular skill executions. Additionally, a context-driven fine-grained strategy and QD-PCoT mitigate localization errors and scale ambiguities, ensuring precise 3D target grounding.

Habitat Demos


VLM-as-Brain Paradigm

Framework Comparison

Unlike prevailing end-to-end approaches that require massive video datasets for pre-training to implicitly learn spatial geometry into policies, AgentVLN adopts a VLM-as-Brain embodied paradigm. This paradigm explicitly decouples high-level cognitive reasoning from low-level skill execution.

Real-World Deployment

AgentVLN Real World

Navigation results in real-world indoor and outdoor environments. Experimental results demonstrate that, regardless of whether the agent is navigating through complex, confined indoor spaces or outdoor scenarios with challenging illumination conditions, the proposed model consistently and accurately comprehends natural language instructions, enabling it to rapidly plan and execute precise navigation trajectories.


Experimental Results

Comparison results on the Val-Unseen dataset for R2R-CE.

DecoVLN

Comparison results on the Val-Unseen dataset for RxR-CE.

DecoVLN

BibTeX

@misc{xin2026agentvlnagenticvisionandlanguagenavigation,
      title={AgentVLN: Towards Agentic Vision-and-Language Navigation}, 
      author={Zihao Xin and Wentong Li and Yixuan Jiang and Ziyuan Huang and Bin Wang and Piji Li and Jianke Zhu and Jie Qin and Shengjun Huang},
      year={2026},
      eprint={2603.17670},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.17670}, 
}