Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
Summary
A new benchmark, V-RAGBench, and a method, CARVE, address critical limitations in Retrieval-Augmented Generation (RAG) for long, egocentric videos. V-RAGBench provides 2,100 high-quality <query, evidence chunk, answer> triplets derived from 216 Ego4D and EgoLife videos, spanning 1-9 hours, enabling faithful, decoupled evaluation of retrieval and generation. CARVE, a novel chunk-aware reranking framework, employs parallel retrievers across four modality-granularity configurations (visual/textual, frame/clip) and uses chunk-adaptive reranking to identify the optimal configuration for each video chunk. CARVE achieved a Recall@5 of 0.603 and nDCG@5 of 0.433, significantly outperforming eight recent VideoRAG baselines. This method also improved generation pass rates, reaching 0.357 with Qwen3-VL-8B, and operates at a competitive 4.6s per query.
Key takeaway
For AI scientists developing VideoRAG systems, consider adopting chunk-level configuration decisions. CARVE demonstrates that tailoring modality and temporal granularity per video chunk significantly boosts both retrieval accuracy and generation quality, surpassing query-level approaches. This method offers a more robust and efficient way to handle the inherent complexity of long, egocentric video data.
Key insights
Optimal video RAG requires chunk-level modality and granularity decisions, not query-level.
Principles
- Video RAG benchmarks need unique, visually-grounded evidence.
- No single modality-granularity configuration is universally optimal.
- Chunk-level decisions improve both retrieval and generation.
Method
CARVE uses parallel retrievers for four configurations, then a multi-modal cross-encoder for chunk-adaptive reranking, propagating the winning configuration to the generator.
In practice
- Implement parallel retrievers for visual/textual, frame/clip data.
- Use a cross-encoder for chunk-level reranking.
- Pass chunk-specific representations to the generator.
Topics
- VideoRAG
- Retrieval-Augmented Generation
- Egocentric Video
- Multimodal Retrieval
- Temporal Granularity
- V-RAGBench
- CARVE
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.