Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate
complex 3D environments. However, existing approaches face two major challenges: constructing an effective
long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose
DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in
long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and
introduce adaptive refinement mechanism that selects frames from a historical candidate pool by
iteratively optimizing a unified scoring function. This function jointly balances three key criteria:
semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of
the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level
corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify
deviation from the expert trajectory, the agent selectively collects high-quality state-action pairs in
the trusted region while filtering out the polluted data with low relevance. This improves both the
efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of our
DecoVLN, and we have deployed it in real-world environments. Codes and models will be released publicly.