TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living
Summary
TimeProVe is a new cost-efficient hybrid framework designed for Long Video Question Answering (LVQA), specifically addressing temporal reasoning in activities of daily living. It tackles the challenge of finding sparse, query-relevant evidence within untrimmed, hours-long videos, where existing methods are either computationally prohibitive due to dense VLM processing or lack temporal precision with caption-based reasoning. TimeProVe operates by first employing lightweight modules to generate action-grounded answer-evidence hypotheses. Subsequently, it invokes an expensive Vision-Language Model (VLM) only for targeted verification, significantly reducing computational overhead. The framework's core, the Action-based Candidate Evidence (ACE) module, uses lightweight LLM reasoning to convert localized actions into query-conditioned candidate answers and supporting evidence windows. Evaluated on the new OpenTSUBench (OTB) benchmark, TimeProVe surpasses the strongest baseline by 7.3%, cuts VLM calls by 75%, and reduces inference cost by 93%. It also performs competitively on Charades-STA and achieves state-of-the-art results when enhanced with grounding VLMs.
Key takeaway
For Computer Vision Engineers developing Long Video Question Answering (LVQA) systems, TimeProVe offers a compelling strategy to drastically cut inference costs and VLM calls. If your current approach struggles with the computational expense of dense VLM processing or misses fine-grained temporal evidence, consider adopting a "propose, then verify" hybrid framework. This method allows you to achieve superior temporal reasoning performance, reducing VLM calls by 75% and inference cost by 93%, without sacrificing accuracy in complex ADL scenarios.
Key insights
TimeProVe efficiently reasons in long videos by proposing hypotheses with lightweight LLMs and verifying them with targeted VLM calls.
Principles
- Propose hypotheses with lightweight LLMs.
- Verify hypotheses using targeted VLM calls.
- Action-grounded evidence enhances relevance.
Method
TimeProVe uses an Action-based Candidate Evidence (ACE) module to convert localized actions into query-conditioned answer-evidence hypotheses via lightweight LLM reasoning, followed by targeted VLM verification.
In practice
- Reduce VLM inference cost by 93%.
- Cut VLM calls by 75% for LVQA.
- Enhance temporal reasoning in ADL videos.
Topics
- Long Video Question Answering
- Temporal Reasoning
- Vision-Language Models
- LLM Reasoning
- Activities of Daily Living
- TimeProVe
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.