LoGoPlanner icon

Localization Grounded Navigation Policy with Metric-aware Visual Geometry

1Tsinghua University.
2Shanghai AI Laboratory
*equal contribution
This is an improved version of our prior work NavDP : aka NavDP--

Abstract

Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3% improvement over oracle-localization baselines and strong generalization across embodiments and environments.

Localization Grounded Planner 

(a) Traditional modular planners decompose tasks into modules, introducing cascading errors.
(b) Existing end-to-end frameworks directly map observations to control signals but still rely on explicit localization modules.
(c) LoGoPlanner integrates implicit state estimation and metric aware geometry perception into policy for fully end-to-end planning.

LoGoPlanner 

Our proposed LoGoPlanner achieves stronger robustness by jointly incorporating ego-state information and multi-frame geometric reconstruction. This design ensures greater consistency in trajectory generation while providing richer spatial perception, which in turn enhances obstacle avoidance and overall navigation performance.

LoGoPlanner Fully E2E Planner

LoGoPlanner injects scale priors into the image patches that are encoded by ViT, and finetunes the video geometry model to metric scale prediction. We adopt a query-based design in which ego state representation and environment geometry are implicitly aggregated through task-specific queries. A diffusion policy head is detached to generate feasible and collision-free trajectories.

Metric-aware Visual Geometry Learning 

Visualization of reconstruction results:
The first row shows the scene point cloud of the ground truth, and the second row shows the predicted scene point cloud.
The point cloud at the metric scale is predicted with the chassis of the last frame as the coordinate origin.

Localization Grounded Navigation Policy 

Visualization of planning&localization results:
The first row shows the planning trajectories, where black trajectories are ground truth and yellow&purple trajectories are prediction, green stars are the goals.
The second row shows the localization results, where blue dots represent the ground truth localization, red dots represent the predicted localization, green arrows represent the ground truth goal points and gray arrows represent predicted goal points.

Visualization of ESDF results:
We also predict the ESDF with the predicted scene point clouds by projecting the in sight pixel embeddings into the BEV plane and constucting voxel feature.

LoGoPlanner Benchmark

InternScenes Benchmark includes 20 home and 20 commercial scenes. Home scenes are characterized by narrow passages and cluttered semantic layouts, while commercial scenes cover representative categories such as hospitals, supermarkets, restaurants, schools, libraries, and offices. In each scene, 100 start-goal pairs are randomly sampled in unoccupied spaces with distances of 4-10 meters, and initial orientations are determined through path planning to avoid collisions.

LoGoPlanner Dataset 

Visualization of trajectory endpoints in the LoGoPlanner dataset.

BibTeX

@misc{logoplanner,
        title = {LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry},
        author = {Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, Tai Wang, Yuan Shen and Jiangmiao Pang},
        year = {2025},
        booktitle={arXiv},
    }