OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

2026-05-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

OmniPro is introduced as the first comprehensive benchmark designed to evaluate omni-proactive streaming video understanding in large language models. This new benchmark addresses limitations of existing evaluations by jointly assessing omni-modal perception, proactive responding, and diverse video understanding tasks. It features 2,700 human-verified samples across 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Crucially, 84% of its samples necessitate audio signals (speech or non-speech), and each sample includes modality-isolation labels for detailed multimodal analysis. OmniPro also incorporates a dual-mode evaluation protocol: "Probe mode" for content understanding and "Online mode" for autonomous proactive response timing in streaming input. Initial evaluations of 11 models using OmniPro indicate that audio consistently improves performance, but its utilization varies, long-horizon robustness is limited, and non-speech audio perception is a significant weakness.

Key takeaway

For research scientists developing omni-modal large language models, OmniPro provides a robust new benchmark to assess proactive streaming video understanding. You should use its dual-mode evaluation to identify specific weaknesses in audio utilization and long-horizon robustness. Prioritize improving non-speech audio perception and model stability over extended timeframes to advance model capabilities effectively.

Key insights

OmniPro is a new benchmark for evaluating omni-modal LLMs in proactive streaming video understanding.

Principles

Audio signals consistently improve video understanding.
Long-horizon robustness is a significant challenge.
Non-speech audio perception is a critical weakness.

Method

OmniPro uses a dual-mode evaluation protocol: Probe mode queries models before/after triggers, and Online mode assesses autonomous response timing in streaming input.

In practice

Prioritize non-speech audio perception improvements.
Focus on long-horizon robustness in streaming models.

Topics

Omni-Proactive Video Understanding
Omni-Modal Large Language Models
Video Understanding Benchmarks
Audio-Visual Perception
Long-Horizon Robustness

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.