OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
Summary
OmniPro is introduced as the first comprehensive benchmark designed to evaluate omni-proactive streaming video understanding in large language models. This new benchmark addresses limitations of existing evaluations by jointly assessing omni-modal perception, proactive responding, and diverse video understanding tasks. It features 2,700 human-verified samples across 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Crucially, 84% of its samples necessitate audio signals (speech or non-speech), and each sample includes modality-isolation labels for detailed multimodal analysis. OmniPro also incorporates a dual-mode evaluation protocol: "Probe mode" for content understanding and "Online mode" for autonomous proactive response timing in streaming input. Initial evaluations of 11 models using OmniPro indicate that audio consistently improves performance, but its utilization varies, long-horizon robustness is limited, and non-speech audio perception is a significant weakness.
Key takeaway
For research scientists developing omni-modal large language models, OmniPro provides a robust new benchmark to assess proactive streaming video understanding. You should use its dual-mode evaluation to identify specific weaknesses in audio utilization and long-horizon robustness. Prioritize improving non-speech audio perception and model stability over extended timeframes to advance model capabilities effectively.
Key insights
OmniPro is a new benchmark for evaluating omni-modal LLMs in proactive streaming video understanding.
Principles
- Audio signals consistently improve video understanding.
- Long-horizon robustness is a significant challenge.
- Non-speech audio perception is a critical weakness.
Method
OmniPro uses a dual-mode evaluation protocol: Probe mode queries models before/after triggers, and Online mode assesses autonomous response timing in streaming input.
In practice
- Prioritize non-speech audio perception improvements.
- Focus on long-horizon robustness in streaming models.
Topics
- Omni-Proactive Video Understanding
- Omni-Modal Large Language Models
- Video Understanding Benchmarks
- Audio-Visual Perception
- Long-Horizon Robustness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.