Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language
instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs)
offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial
perception, 2D–3D representation mismatch, and monocular scale ambiguity. In this paper, we propose
AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing
platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce
a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a
plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space
representation mapping that projects perception-layer 3D topological waypoints into the image plane,
yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware
self-correction and active exploration strategy to recover from occlusions and suppress error accumulation
over long trajectories. To further address the spatial ambiguity of instructions in unstructured
environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent
with the metacognitive ability to actively seek geometric depth information. Finally, we construct
AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on
target visibility. Extensive experiments show that AgentVLN consistently outperforms prior
state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for
lightweight deployment of next-generation embodied navigation models.