Native Active Perception as Reasoning for Omni-Modal Understanding
Summary
OmniAgent is introduced as the first native omni-modal agent designed for long video understanding, addressing the computational inefficiencies of passive "watch-it-all" models. It redefines video understanding as a Partially Observable Markov Decision Process (POMDP)-based iterative Observation-Thought-Action cycle. This agent selectively distills audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To achieve this, OmniAgent employs Agentic Supervised Fine-Tuning for bootstrapping active perception through best-of-N trajectory synthesis with dual-stage quality control, and Agentic Reinforcement Learning with TAURA, which uses turn-level entropy for credit assignment. Crucially, OmniAgent demonstrates positive test-time scaling, improving performance with more reasoning turns. It achieves state-of-the-art performance among open-source models across ten benchmarks, including VideoMME and LVBench, with its 7B agent outperforming the 10x larger Qwen2.5-VL-72B (50.5% vs. 47.3%) on LVBench.
Key takeaway
For Machine Learning Engineers developing models for long video understanding, OmniAgent demonstrates a critical shift from passive "watch-it-all" paradigms. You should explore active perception frameworks that decouple reasoning complexity from video duration, as this approach allows smaller models, like OmniAgent's 7B, to outperform significantly larger ones. Consider implementing iterative Observation-Thought-Action cycles to improve performance with increased reasoning turns, optimizing both efficiency and accuracy in your systems.
Key insights
OmniAgent employs active, iterative perception and reasoning to efficiently understand long videos, decoupling computational cost from video duration.
Principles
- Formulate understanding as POMDP-based O-T-A.
- Decouple reasoning complexity from duration.
- Active perception improves with more turns.
Method
OmniAgent operationalizes active perception via Agentic Supervised Fine-Tuning (best-of-N trajectory synthesis) and Agentic Reinforcement Learning with TAURA, using turn-level entropy for credit assignment.
In practice
- Process long videos without uniform frame processing.
- Achieve SOTA performance with smaller models.
- Improve understanding with more reasoning turns.
Topics
- Active Perception
- Omni-Modal Understanding
- Video Understanding
- Reinforcement Learning
- POMDP
- Model Efficiency
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.