LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3% improvement over oracle-localization baselines and strong generalization across embodiments and environments.

(a) Traditional modular planners decompose tasks into modules, introducing cascading errors.
(b) Existing end-to-end frameworks directly map observations to control signals but still rely on explicit localization modules.
(c) LoGoPlanner integrates implicit state estimation and metric aware geometry perception into policy for fully end-to-end planning.

Our proposed LoGoPlanner achieves stronger robustness by jointly incorporating ego-state information and multi-frame geometric reconstruction. This design ensures greater consistency in trajectory generation while providing richer spatial perception, which in turn enhances obstacle avoidance and overall navigation performance.

LoGoPlanner injects scale priors into the image patches that are encoded by ViT, and finetunes the video geometry model to metric scale prediction. We adopt a query-based design in which ego state representation and environment geometry are implicitly aggregated through task-specific queries. A diffusion policy head is detached to generate feasible and collision-free trajectories.

Visualization of reconstruction results:
The first row shows the scene point cloud of the ground truth, and the second row shows the predicted scene point cloud.
The point cloud at the metric scale is predicted with the chassis of the last frame as the coordinate origin.

Visualization of planning&localization results:
The first row shows the planning trajectories, where black trajectories are ground truth and yellow&purple trajectories are prediction, green stars are the goals.
The second row shows the localization results, where blue dots represent the ground truth localization, red dots represent the predicted localization, green arrows represent the ground truth goal points and gray arrows represent predicted goal points.

Visualization of ESDF results:
We also predict the ESDF with the predicted scene point clouds by projecting the in sight pixel embeddings into the BEV plane and constucting voxel feature.

InternScenes Benchmark includes 20 home and 20 commercial scenes. Home scenes are characterized by narrow passages and cluttered semantic layouts, while commercial scenes cover representative categories such as hospitals, supermarkets, restaurants, schools, libraries, and offices. In each scene, 100 start-goal pairs are randomly sampled in unoccupied spaces with distances of 4-10 meters, and initial orientations are determined through path planning to avoid collisions.

Visualization of trajectory endpoints in the LoGoPlanner dataset.

BibTeX

@misc{logoplanner,
        title = {LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry},
        author = {Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, Tai Wang, Yuan Shen and Jiangmiao Pang},
        year = {2025},
        booktitle={arXiv},
    }

Localization Grounded Navigation Policy with Metric-aware Visual Geometry

Abstract

Localization Grounded Planner

LoGoPlanner

Goal Reaching without Odometry: speed x1

Cross Embodiments: speed x1

Dynamic Avoidance: speed x1

Safe passage in confined spaces: speed x1

LoGoPlanner Fully E2E Planner

Metric-aware Visual Geometry Learning

Localization Grounded Navigation Policy

LoGoPlanner Benchmark

LoGoPlanner Dataset

BibTeX