TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

TimeProVe is a new cost-efficient hybrid framework designed for Long Video Question Answering (LVQA), specifically addressing temporal reasoning in activities of daily living. It tackles the challenge of finding sparse, query-relevant evidence within untrimmed, hours-long videos, where existing methods are either computationally prohibitive due to dense VLM processing or lack temporal precision with caption-based reasoning. TimeProVe operates by first employing lightweight modules to generate action-grounded answer-evidence hypotheses. Subsequently, it invokes an expensive Vision-Language Model (VLM) only for targeted verification, significantly reducing computational overhead. The framework's core, the Action-based Candidate Evidence (ACE) module, uses lightweight LLM reasoning to convert localized actions into query-conditioned candidate answers and supporting evidence windows. Evaluated on the new OpenTSUBench (OTB) benchmark, TimeProVe surpasses the strongest baseline by 7.3%, cuts VLM calls by 75%, and reduces inference cost by 93%. It also performs competitively on Charades-STA and achieves state-of-the-art results when enhanced with grounding VLMs.

Key takeaway

For Computer Vision Engineers developing Long Video Question Answering (LVQA) systems, TimeProVe offers a compelling strategy to drastically cut inference costs and VLM calls. If your current approach struggles with the computational expense of dense VLM processing or misses fine-grained temporal evidence, consider adopting a "propose, then verify" hybrid framework. This method allows you to achieve superior temporal reasoning performance, reducing VLM calls by 75% and inference cost by 93%, without sacrificing accuracy in complex ADL scenarios.

Key insights

TimeProVe efficiently reasons in long videos by proposing hypotheses with lightweight LLMs and verifying them with targeted VLM calls.

Principles

Method

TimeProVe uses an Action-based Candidate Evidence (ACE) module to convert localized actions into query-conditioned answer-evidence hypotheses via lightweight LLM reasoning, followed by targeted VLM verification.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.