SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

SurroundNEXO, a novel low-overlap multi-camera metric depth framework, addresses challenges in autonomous driving's 3D understanding by grounding cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Designed for vehicle-mounted surround-view camera rigs with limited visual overlap, it first assigns image tokens globally comparable ego-frame viewing directions using Ego-Ray Positional Encoding. It then leverages sparse LiDAR measurements as metric anchors to propagate absolute scale cues. Finally, SurroundNEXO progressively expands feature interaction from view-local modeling to decomposed spatio-temporal reasoning and global integration. This approach enables metric-scale depth prediction with improved spatial consistency. Across NuScenes, Waymo, and DDAD benchmarks, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared to SOTA methods, also showing robustness under sparse depth prompts and strong zero-shot generalization.

Key takeaway

For Computer Vision Engineers developing autonomous driving perception systems, SurroundNEXO offers a robust solution for accurate multi-camera depth prediction. If you are struggling with spatial consistency across low-overlap camera views, consider adopting its ego-centric geometry and sparse LiDAR anchoring approach. This method significantly improves metric reconstruction quality and cross-view consistency, providing a path to more reliable 3D understanding and planning.

Key insights

SurroundNEXO bridges low-overlap multi-camera views for metric depth prediction using ego-centric geometry and sparse LiDAR anchoring.

Principles

Method

Assign image tokens ego-frame viewing directions via Ego-Ray Positional Encoding, use sparse LiDAR as metric anchors, then expand feature interaction from view-local to spatio-temporal and global.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.