OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

OmniPro is introduced as the first comprehensive benchmark designed to evaluate omni-proactive streaming video understanding in large language models. This new benchmark addresses limitations of existing evaluations by jointly assessing omni-modal perception, proactive responding, and diverse video understanding tasks. It features 2,700 human-verified samples across 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Crucially, 84% of its samples necessitate audio signals (speech or non-speech), and each sample includes modality-isolation labels for detailed multimodal analysis. OmniPro also incorporates a dual-mode evaluation protocol: "Probe mode" for content understanding and "Online mode" for autonomous proactive response timing in streaming input. Initial evaluations of 11 models using OmniPro indicate that audio consistently improves performance, but its utilization varies, long-horizon robustness is limited, and non-speech audio perception is a significant weakness.

Key takeaway

For research scientists developing omni-modal large language models, OmniPro provides a robust new benchmark to assess proactive streaming video understanding. You should use its dual-mode evaluation to identify specific weaknesses in audio utilization and long-horizon robustness. Prioritize improving non-speech audio perception and model stability over extended timeframes to advance model capabilities effectively.

Key insights

OmniPro is a new benchmark for evaluating omni-modal LLMs in proactive streaming video understanding.

Principles

Method

OmniPro uses a dual-mode evaluation protocol: Probe mode queries models before/after triggers, and Online mode assesses autonomous response timing in streaming input.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.