Seeing Together:Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
Summary
A new benchmark and framework, CoopSR and SP-CoR, address the challenge of multi-robot cooperative dynamic spatial reasoning using Multimodal Large Language Models (MLLMs). CoopSR introduces EgoTeam, the first dataset for this task, comprising 114,227 QA pairs across 19 question types, four difficulty tiers, and three team sizes in simulated environments (Habitat and iGibson), alongside a real-world test set of approximately 2,326 QAs from two quadruped robots. The proposed SP-CoR framework enhances fine-grained cooperative spatial reasoning by integrating dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation. This allows SP-CoR to benefit from privileged robot-pose supervision during training while only requiring egocentric videos at test time. SP-CoR consistently outperforms 22 MLLM baselines, achieving a +3.87% improvement on Habitat and +7.12% on iGibson, demonstrating stronger generalization to unseen team sizes and real-world scenarios.
Key takeaway
For research scientists developing multi-robot systems, this work highlights the critical need for specialized datasets and physics-informed MLLM architectures. You should consider integrating dynamics-aware sampling and physics-guided view fusion into your models to enhance cooperative spatial reasoning, especially when aiming for robust generalization across varying team sizes and real-world deployments. The EgoTeam dataset offers a valuable resource for benchmarking and training your next-generation multi-robot perception systems.
Key insights
Multi-robot egocentric spatial reasoning benefits from physics-informed MLLMs and specialized cooperative datasets.
Principles
- Integrate dynamics-aware sampling for multi-robot video.
- Fuse views using spectral and physics guidance.
- Distill prompts with physics alignment.
Method
SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation to enable MLLMs to perform cooperative spatial reasoning from egocentric videos.
In practice
- Utilize EgoTeam dataset for multi-robot QA.
- Apply SP-CoR for improved cooperative reasoning.
- Test generalization with unseen team sizes.
Topics
- Multi-Robot Cooperative Reasoning
- Egocentric Spatial Reasoning
- Multimodal Large Language Models
- CoopSR Benchmark
- EgoTeam Dataset
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.