publications | Jiaqi Peng

2025

LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, and 3 more authors

Arxiv, 2025, 2025

Abs HTML

Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3% improvement over oracle-localization baselines and strong generalization across embodiments and environments.
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation

Meng Wei, Chenyang Wan, Jiaqi Peng, and 8 more authors

Arxiv, 2025, 2025

Abs HTML

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, “grounds slowly” by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, “moves fast” by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
InternVLA-N1: An Open Dual-System Vision-Language Navigation Foundation Model with Learned Latent Plans

Shanghai AI Laboratory Intern Robotics

Arxiv, 2025, 2025

Abs HTML PDF

We introduce InternVLA-N1, the first open dual-system vision-language navigation foundation model. Unlike previous navigation foundation models that can only take short-term actions from a limited discrete space, InternVLA-N1 decouples the task as pixel-goal planning with System 2 and agile execution with System 1. A curriculum two-stage training paradigm is devised for this framework: First, two systems are pretrained with explicit pixel goals as supervision or condition. Subsequently, we freeze System 2 and finetune the newly added latent plans with System 1 in an asynchronous end-to-end manner. Such a paradigm relying on latent plans as the intermediate representation removes the ambiguity of pixel goal planning and provides new potentials for pretraining extensions with video prediction. To enable scalable training, we develop an efficient navigation data generation pipeline and introduce InternData-N1, the largest navigation dataset to date. InternData-N1 comprises over 50 million egocentric images collected from more than 3,000 scenes, amounting to 4,839 kilometers of robot navigation experience. We evaluate InternVLA-N1 across 6 challenging navigation benchmarks, where it consistently achieves state-of-the-art performance, with improvements ranging from 3% to 28%. In particular, it demonstrates synergistic integration of long-horizon planning (>150m) and real-time decision-making (>30Hz) capabilities and can be zero-shot generalized across diverse embodiments (wheeled, quadruped, humanoid) and in-the-wild environments. All code, models, and datasets are publicly available.
NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance

Wenzhe Cai, Jiaqi Peng, Yuqiang Yang, and 6 more authors

Arxiv, 2025, 2025

Abs HTML PDF

We present a sim-to-real navigation diffusion policy that can achieve cross-embodiment generalization in dynamic, cluttered and diverse real-world scenarios.
Towards Latency-Aware 3D Streaming Perception for Autonomous Driving

Jiaqi Peng, Tai Wang, Jiangmiao Pang, and 1 more author

2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

Abs PDF

Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical feature regardless of varying latency; 2) latency-aware predictive detection, a module that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.

2022

结合时空一致性的 FairMOT 跟踪算法优化

彭嘉淇, 王涛, 陈柯安, and 1 more author

中国图象图形学报, 2022

Abs

Objective Video-based multiple object tracking is one of the essential tasks in computer vision like automatic driving and intelligent video surveillance system.Most of the multiple object tracking methods tend to obtain object detection results first.The integrated strategies are used to link detection bounding boxes and form object trajectories.Current object detection contexts have been developing recently.But,the challenging inconsistency issues are required to be resolved in multiple object tracking,which affected the multi-objects tracking accuracy.The multi-objects tracking inconsistency can be classified into three types as mentioned below:1) the inconsistency between the centers of the object bounding boxes and those object identity features.Many multiple object tracking methods are extracted the object re-identification (ReID) features at the object bounding boxes centers and these features are used to in associate with objects.However,those oriented ReID features are incapable to reflect the appearance of objects accurately due to the occlusion.The offsets are appeared between the best ReID feature extraction positions and bounding box centers.Current feature extraction strategy will lead to the spatial consistency problem.2) The inconsistency of the object center response between consecutive frames.Some objects can be detected and tracked in the contexted frames due to the occlusion in videos.It causes consecutive frames loss and the inconsistency between the object-center-responsed heatmaps of two consecutive frames.3) The inconsistency of the similarity assessment in the training process and testing process.Most of association step is considered as a classification problem and the cross entropy loss is used to train the model while the inter-object relations are ignored in the testing process.The feature cosine similarities of each pair of objects are used to associate them.To improve the accuracy of tracking,we facilitate a multiple object tracking method based on consistency optimization.