TagaVLM: Topology-Aware Global Action Reasoning
for Vision-Language Navigation

ICRA 2026
Jiaxing Liu1,*, Zexi Zhang1,2,*, Xiaoyan Li1,†, Boyue Wang1, Yongli Hu1, Baocai Yin1
1Beijing University of Technology, 2Imperial College London
* Equal contribution  ·  † Corresponding author
Motivation of TagaVLM

Motivation. Previous methods (c) employ a two-stage pipeline that converts visual observations to text, losing crucial visual information. Our TagaVLM (b) is an end-to-end paradigm that preserves VLM pretraining knowledge while directly incorporating online topological map information, enabling global action decisions (a) with backtracking ability.

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities.

To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.

On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted architectural enhancements on smaller open-source VLMs can be more effective than brute-force model scaling.

Contributions

The core thesis of this work is that the gap between disembodied VLMs and embodied navigation can be most effectively closed not by scaling model parameters, but by injecting the right structural priors directly into the architecture. We contribute:

  1. An end-to-end topology-aware VLN framework. TagaVLM is the first to architecturally embed topological graph structures into a VLM backbone for vision-language navigation, bridging the disembodied–embodied gap without sacrificing pretrained knowledge.
  2. Two synergistic mechanisms for graph injection. The Interleaved Navigation Prompt (INP) structures the input sequence to mirror the graph's node layout, strengthening node-level visual-text alignment. The Spatial Topology Aware Residual Attention (STAR-Att) injects topological edge relationships directly into the self-attention layers as a learnable residual bias, enabling the model to reason over spatial structure while preserving its expressive capacity.
  3. Evidence that inductive bias matters more than scale. TagaVLM-0.5B, with architecturally injected topological priors, achieves competitive results against proprietary models orders of magnitude larger, while the 7B version sets a new state of the art—demonstrating that proper structural design is a compelling alternative to brute-force scaling for embodied spatial reasoning.

Method Overview

TagaVLM consists of four tightly-coupled components. (1) An online topological map records the visual observations, node types, and pairwise distances as the agent explores. (2) The Interleaved Navigation Prompt inserts each node's visual tokens at the corresponding <image> placeholder in the textual prompt, so that visual features are contextually adjacent to their node descriptions. (3) STAR-Att replaces every self-attention layer in the LLM backbone, adding a per-head learnable bias derived from the topological distance matrix; closer nodes attend more strongly, even when their visual features differ. (4) A global action space over all observed candidate nodes enables the agent to select any reachable target—including backtracking—at every step.

TagaVLM Framework

Architecture of TagaVLM. The observation encoder maps RGB images to visual tokens. These are interleaved with textual node descriptions to form the INP. The LLM backbone, augmented with STAR-Att, fuses semantic and spatial information to produce a global action decision.

STAR-Att: Spatial Topology Aware Residual Attention

Standard self-attention treats all token pairs equally, regardless of their spatial relationship in the environment. STAR-Att addresses this by expanding the node-level pairwise distance matrix to the token level and adding it as a learnable, per-head residual bias to the attention scores. Tokens belonging to spatially closer nodes receive higher attention—even when their visual features are dissimilar—while the residual design preserves the pretrained semantic knowledge of the original attention. This provides a flexible inductive prior rather than a rigid constraint, allowing each attention head to independently calibrate the strength of spatial reasoning.

Visualization of the STAR-Att mechanism: topological distance information is injected as a residual attention bias, enabling spatially-aware reasoning across the navigation graph.

Qualitative Analysis: Path Correction via Global Action Reasoning

A key advantage of TagaVLM is its ability to recover from navigation errors. In the example below, the agent initially moves in an incorrect direction due to the absence of visible landmarks. At Step 2, it leverages its spatial-topological understanding to recognize the mismatch with the instruction, performs a global action to backtrack to Node 1, and proceeds to the correct candidate node. The remaining steps successfully follow the instruction landmarks (black chairs → turn right → refrigerator) until the agent reaches the target and issues a stop decision.

Navigation case study. The agent detects an early incorrect decision, backtracks via global action reasoning, and successfully completes the navigation trajectory to the target destination.

Quantitative Results on R2R

We compare with both cross-modal-backbone methods and large-model-based methods on the R2R benchmark. TagaVLM surpasses all prior large-model-based approaches across every metric on both seen and unseen environments. Notably, our 0.5B model already exceeds most large-model methods—including those built on proprietary GPT-4/GPT-4V—demonstrating that the right architectural priors can compensate for orders-of-magnitude differences in model scale.

Method Backbone Val Seen Val Unseen
TLNE↓OSR↑SR↑SPL↑ TLNE↓OSR↑SR↑SPL↑
HAMT Cross-Modal Trans. 11.152.527672 11.462.296661
DUET Cross-Modal Trans. 12.322.28867973 13.943.31817260
BEVBert Cross-Modal Trans. 13.561.67888174 14.552.81847564
ScaleVLN Cross-Modal Trans. 13.242.12878175 14.092.09888170
NavGPT GPT-4* 11.456.46423429
LangNav LLaMA2 (7B) 7.4403228 7.1453429
DiscussNav GPT-4* 9.695.324336.4040
NavCoT LLaMA2 (7B) 10.086.4648.3841.3338.43 9.956.2648.1140.2336.64
MapGPT GPT-4V* 5.6257.947.738.1
TagaVLM (ours) Qwen2 (0.5B) 10.085.2360.0353.4850.4 9.85.5755.0945.7241.91
TagaVLM (ours) Qwen2 (7B) 10.164.7164.1555.5353.05 9.74.9760.251.0947.18

* Proprietary models accessed via black-box API. Cross-modal methods (gray) are shown for reference; our comparison target is the large-model category.

Success Rate (Unseen)
51.09%
+3.39% vs. MapGPT
SPL (Unseen)
47.18
+9.08 vs. MapGPT
Smallest Effective Model
0.5B
Outperforms GPT-4 based methods

Ablation Highlights

We systematically ablate each component on TagaVLM-0.5B (val unseen). Key findings:

STAR-AttINPGlobal ActionAug. Data NE↓OSR↑SR↑SPL↑
(a) 9.0527.3717.2813.01
(b) 7.7435.6726.1420.81
(c) 6.4947.4738.4035.61
(e) 6.0652.4142.0637.73
(f) 5.5755.0945.7241.91

STAR-Att alone yields +8.86% SR over the vanilla fine-tuned baseline (a→b), confirming that explicit spatial priors are far more effective than relying on the model to learn topological relationships implicitly. Adding INP further improves SR by 12.26% (b→c), as the interleaved layout provides the contextual scaffolding that STAR-Att needs to apply spatial biases to the correct tokens. Global action reasoning contributes another +3.66% SR (c→e) through its backtracking capability. Together, the full system achieves 28.44% absolute SR improvement over the baseline.

STAR-Att vs. Text-Based Topological Map

STAR-AttText-Based Map SR↑SPL↑
(a) 39.7635.67
(b) 40.7036.92
(c) 42.0637.73

Text-based topological descriptions (following MapGPT) improve SR by only +0.94%, while STAR-Att provides +2.30%—confirming that architectural injection of spatial priors is significantly more effective than textual prompting.

Demo Video

Real-time navigation of TagaVLM in the Matterport3D simulator, demonstrating end-to-end instruction following, topological awareness, and path correction.

BibTeX

@inproceedings{liu2026tagavlm,
  author    = {Liu, Jiaxing and Zhang, Zexi and Li, Xiaoyan and Wang, Boyue and Hu, Yongli and Yin, Baocai},
  title     = {TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
}