Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities.
To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted architectural enhancements on smaller open-source VLMs can be more effective than brute-force model scaling.
The core thesis of this work is that the gap between disembodied VLMs and embodied navigation can be most effectively closed not by scaling model parameters, but by injecting the right structural priors directly into the architecture. We contribute:
TagaVLM consists of four tightly-coupled components.
(1) An online topological map records the visual observations, node types, and pairwise distances as the agent explores.
(2) The Interleaved Navigation Prompt inserts each node's visual tokens at the corresponding <image> placeholder in the textual prompt, so that visual features are contextually adjacent to their node descriptions.
(3) STAR-Att replaces every self-attention layer in the LLM backbone, adding a per-head learnable bias derived from the topological distance matrix; closer nodes attend more strongly, even when their visual features differ.
(4) A global action space over all observed candidate nodes enables the agent to select any reachable target—including backtracking—at every step.
Architecture of TagaVLM. The observation encoder maps RGB images to visual tokens. These are interleaved with textual node descriptions to form the INP. The LLM backbone, augmented with STAR-Att, fuses semantic and spatial information to produce a global action decision.
Standard self-attention treats all token pairs equally, regardless of their spatial relationship in the environment. STAR-Att addresses this by expanding the node-level pairwise distance matrix to the token level and adding it as a learnable, per-head residual bias to the attention scores. Tokens belonging to spatially closer nodes receive higher attention—even when their visual features are dissimilar—while the residual design preserves the pretrained semantic knowledge of the original attention. This provides a flexible inductive prior rather than a rigid constraint, allowing each attention head to independently calibrate the strength of spatial reasoning.
Visualization of the STAR-Att mechanism: topological distance information is injected as a residual attention bias, enabling spatially-aware reasoning across the navigation graph.
A key advantage of TagaVLM is its ability to recover from navigation errors. In the example below, the agent initially moves in an incorrect direction due to the absence of visible landmarks. At Step 2, it leverages its spatial-topological understanding to recognize the mismatch with the instruction, performs a global action to backtrack to Node 1, and proceeds to the correct candidate node. The remaining steps successfully follow the instruction landmarks (black chairs → turn right → refrigerator) until the agent reaches the target and issues a stop decision.
Navigation case study. The agent detects an early incorrect decision, backtracks via global action reasoning, and successfully completes the navigation trajectory to the target destination.
We compare with both cross-modal-backbone methods and large-model-based methods on the R2R benchmark. TagaVLM surpasses all prior large-model-based approaches across every metric on both seen and unseen environments. Notably, our 0.5B model already exceeds most large-model methods—including those built on proprietary GPT-4/GPT-4V—demonstrating that the right architectural priors can compensate for orders-of-magnitude differences in model scale.
| Method | Backbone | Val Seen | Val Unseen | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| TL | NE↓ | OSR↑ | SR↑ | SPL↑ | TL | NE↓ | OSR↑ | SR↑ | SPL↑ | ||
| HAMT | Cross-Modal Trans. | 11.15 | 2.52 | – | 76 | 72 | 11.46 | 2.29 | – | 66 | 61 |
| DUET | Cross-Modal Trans. | 12.32 | 2.28 | 86 | 79 | 73 | 13.94 | 3.31 | 81 | 72 | 60 |
| BEVBert | Cross-Modal Trans. | 13.56 | 1.67 | 88 | 81 | 74 | 14.55 | 2.81 | 84 | 75 | 64 |
| ScaleVLN | Cross-Modal Trans. | 13.24 | 2.12 | 87 | 81 | 75 | 14.09 | 2.09 | 88 | 81 | 70 |
| NavGPT | GPT-4* | – | – | – | – | – | 11.45 | 6.46 | 42 | 34 | 29 |
| LangNav | LLaMA2 (7B) | – | 7.4 | 40 | 32 | 28 | – | 7.1 | 45 | 34 | 29 |
| DiscussNav | GPT-4* | – | – | – | – | – | 9.69 | 5.32 | 43 | 36.40 | 40 |
| NavCoT | LLaMA2 (7B) | 10.08 | 6.46 | 48.38 | 41.33 | 38.43 | 9.95 | 6.26 | 48.11 | 40.23 | 36.64 |
| MapGPT | GPT-4V* | – | – | – | – | – | – | 5.62 | 57.9 | 47.7 | 38.1 |
| TagaVLM (ours) | Qwen2 (0.5B) | 10.08 | 5.23 | 60.03 | 53.48 | 50.4 | 9.8 | 5.57 | 55.09 | 45.72 | 41.91 |
| TagaVLM (ours) | Qwen2 (7B) | 10.16 | 4.71 | 64.15 | 55.53 | 53.05 | 9.7 | 4.97 | 60.2 | 51.09 | 47.18 |
* Proprietary models accessed via black-box API. Cross-modal methods (gray) are shown for reference; our comparison target is the large-model category.
We systematically ablate each component on TagaVLM-0.5B (val unseen). Key findings:
| STAR-Att | INP | Global Action | Aug. Data | NE↓ | OSR↑ | SR↑ | SPL↑ | |
|---|---|---|---|---|---|---|---|---|
| (a) | ✗ | ✗ | ✗ | ✗ | 9.05 | 27.37 | 17.28 | 13.01 |
| (b) | ✓ | ✗ | ✗ | ✗ | 7.74 | 35.67 | 26.14 | 20.81 |
| (c) | ✓ | ✓ | ✗ | ✗ | 6.49 | 47.47 | 38.40 | 35.61 |
| (e) | ✓ | ✓ | ✓ | ✗ | 6.06 | 52.41 | 42.06 | 37.73 |
| (f) | ✓ | ✓ | ✓ | ✓ | 5.57 | 55.09 | 45.72 | 41.91 |
STAR-Att alone yields +8.86% SR over the vanilla fine-tuned baseline (a→b), confirming that explicit spatial priors are far more effective than relying on the model to learn topological relationships implicitly. Adding INP further improves SR by 12.26% (b→c), as the interleaved layout provides the contextual scaffolding that STAR-Att needs to apply spatial biases to the correct tokens. Global action reasoning contributes another +3.66% SR (c→e) through its backtracking capability. Together, the full system achieves 28.44% absolute SR improvement over the baseline.
| STAR-Att | Text-Based Map | SR↑ | SPL↑ | |
|---|---|---|---|---|
| (a) | ✗ | ✗ | 39.76 | 35.67 |
| (b) | ✗ | ✓ | 40.70 | 36.92 |
| (c) | ✓ | ✗ | 42.06 | 37.73 |
Text-based topological descriptions (following MapGPT) improve SR by only +0.94%, while STAR-Att provides +2.30%—confirming that architectural injection of spatial priors is significantly more effective than textual prompting.
Real-time navigation of TagaVLM in the Matterport3D simulator, demonstrating end-to-end instruction following, topological awareness, and path correction.
@inproceedings{liu2026tagavlm,
author = {Liu, Jiaxing and Zhang, Zexi and Li, Xiaoyan and Wang, Boyue and Hu, Yongli and Yin, Baocai},
title = {TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026},
}