DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

Zihao Xin1,*, Wentong Li1,*†, Yixuan Jiang1, Bin Wang2, Runmin Cong2, Jie Qin1, ✉, Sheng-Jun Huang1,✉, 1Nanjing University of Aeronautics and Astronautics
2 Shandong University

Project Lead    *Co-first Authors    Corresponding Authors   

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent selectively collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of our DecoVLN, and we have deployed it in real-world environments. Codes and models will be released publicly.

pipeline image

The framework of DecoVLN. DecoVLN decouples the agent's observation and reasoning processes. The agent can perceive the environment continuously while in motion and, based on the Adaptive Memory Refinement (AMR) mechanism, it filters and stores high-information-density state representations into a memory bank. During the generation phase, the large language model outputs the action chunk which is comprising multiple consecutive actions—based on the input instruction, the current frame, and the memory bank. Subsequently, we construct an error-correction strategy based on state-action pairs. The model autonomously explores according to the instruction and collects State-Action Pairs within a trusted region for error-correction fine-tuning. This process not only enhances data utilization efficiency but also equips the model with introspective and self-correction capabilities.

Real World Demos


Decoupling Observation and Reasoning

Comparison with Uniform Sampling

Current paradigm requires storing all historical observation sequences, sampling them during inference, and repeatedly transferring the selected frames between RAM and VRAM.

Comparison with Uniform Sampling

Our DecoVLN, in contrast, introduces an adaptive memory-refinement mechanism during the observation phase. This design selectively preserves high-value semantic information in a VRAM-resident memory bank, which is directly consumed by the VLN model during inference.

Experimental Results

Comparison with SOTA methods

DecoVLN

* indicates training with additional large-scale datasets. Our method achieves the best results under fair settings, without using global priors or multi-sensor inputs.


BibTeX

@misc{xin2026decovln,
      title={DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation}, 
      author={Zihao Xin and Wentong Li and Yixuan Jiang and Bin Wang and Runming Cong and Jie Qin and Shengjun Huang},
      year={2026},
      eprint={2603.13133},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.13133}, 
}