RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

2026-04-30 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine) is an online semantic SLAM system designed for geometry-aware, open-vocabulary grounding in dynamic environments. Unlike conventional methods, it operates directly on raw monocular RGB video streams, eliminating the need for camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal vision and language embeddings from agglomerative foundation models like RADIO with geometric scene information during initialization, optimization, and factor graph connections. Its optimization framework uses adaptive robust kernels to handle both actively moving objects and agent-displaced scene elements. Experiments show RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark and competitive performance against offline open-vocabulary methods that assume calibrated data and static scenes, bridging a critical gap for real-world robotic deployments.

Key takeaway

For Computer Vision Engineers developing autonomous robotics or AR/VR applications, RADIO-ViPE offers a robust solution for real-time, open-vocabulary semantic SLAM in dynamic, unconstrained environments. Its ability to operate without prior calibration, depth sensors, or pose initialization from raw monocular RGB video streams significantly simplifies deployment and expands applicability to diverse real-world scenarios, making it a strong candidate for systems requiring flexible, language-driven scene interpretation.

Key insights

RADIO-ViPE enables robust, calibration-free, open-vocabulary semantic SLAM in dynamic environments using monocular RGB video.

Principles

Tightly couple multi-modal embeddings with geometric constraints.
Employ adaptive robust kernels for dynamic scene handling.
Utilize temporal consistency to classify scene elements.

Method

RADIO-ViPE fuses dense visual features and depth estimates within a sliding window factor graph, optimized via joint bundle adjustment with a temporally consistent adaptive robust kernel.

In practice

Integrate vision, language, and geometry for 3D scene understanding.
Process uncalibrated monocular RGB video for SLAM.
Achieve real-time open-vocabulary grounding for robotics.

Topics

Open-Vocabulary Semantic SLAM
Multi-Modal Fusion
Dynamic Environment Handling
Calibration-Free Systems
Bundle Adjustment

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.