RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
Summary
RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine) is an online semantic SLAM system designed for geometry-aware, open-vocabulary grounding in dynamic environments. Unlike conventional methods, it operates directly on raw monocular RGB video streams, eliminating the need for camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal vision and language embeddings from agglomerative foundation models like RADIO with geometric scene information during initialization, optimization, and factor graph connections. Its optimization framework uses adaptive robust kernels to handle both actively moving objects and agent-displaced scene elements. Experiments show RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark and competitive performance against offline open-vocabulary methods that assume calibrated data and static scenes, bridging a critical gap for real-world robotic deployments.
Key takeaway
For Computer Vision Engineers developing autonomous robotics or AR/VR applications, RADIO-ViPE offers a robust solution for real-time, open-vocabulary semantic SLAM in dynamic, unconstrained environments. Its ability to operate without prior calibration, depth sensors, or pose initialization from raw monocular RGB video streams significantly simplifies deployment and expands applicability to diverse real-world scenarios, making it a strong candidate for systems requiring flexible, language-driven scene interpretation.
Key insights
RADIO-ViPE enables robust, calibration-free, open-vocabulary semantic SLAM in dynamic environments using monocular RGB video.
Principles
- Tightly couple multi-modal embeddings with geometric constraints.
- Employ adaptive robust kernels for dynamic scene handling.
- Utilize temporal consistency to classify scene elements.
Method
RADIO-ViPE fuses dense visual features and depth estimates within a sliding window factor graph, optimized via joint bundle adjustment with a temporally consistent adaptive robust kernel.
In practice
- Integrate vision, language, and geometry for 3D scene understanding.
- Process uncalibrated monocular RGB video for SLAM.
- Achieve real-time open-vocabulary grounding for robotics.
Topics
- Open-Vocabulary Semantic SLAM
- Multi-Modal Fusion
- Dynamic Environment Handling
- Calibration-Free Systems
- Bundle Adjustment
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.