Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

A new benchmark, V-RAGBench, and a method, CARVE, address critical limitations in Retrieval-Augmented Generation (RAG) for long, egocentric videos. V-RAGBench provides 2,100 high-quality <query, evidence chunk, answer> triplets derived from 216 Ego4D and EgoLife videos, spanning 1-9 hours, enabling faithful, decoupled evaluation of retrieval and generation. CARVE, a novel chunk-aware reranking framework, employs parallel retrievers across four modality-granularity configurations (visual/textual, frame/clip) and uses chunk-adaptive reranking to identify the optimal configuration for each video chunk. CARVE achieved a Recall@5 of 0.603 and nDCG@5 of 0.433, significantly outperforming eight recent VideoRAG baselines. This method also improved generation pass rates, reaching 0.357 with Qwen3-VL-8B, and operates at a competitive 4.6s per query.

Key takeaway

For AI scientists developing VideoRAG systems, consider adopting chunk-level configuration decisions. CARVE demonstrates that tailoring modality and temporal granularity per video chunk significantly boosts both retrieval accuracy and generation quality, surpassing query-level approaches. This method offers a more robust and efficient way to handle the inherent complexity of long, egocentric video data.

Key insights

Optimal video RAG requires chunk-level modality and granularity decisions, not query-level.

Principles

Video RAG benchmarks need unique, visually-grounded evidence.
No single modality-granularity configuration is universally optimal.
Chunk-level decisions improve both retrieval and generation.

Method

CARVE uses parallel retrievers for four configurations, then a multi-modal cross-encoder for chunk-adaptive reranking, propagating the winning configuration to the generator.

In practice

Implement parallel retrievers for visual/textual, frame/clip data.
Use a cross-encoder for chunk-level reranking.
Pass chunk-specific representations to the generator.

Topics

VideoRAG
Retrieval-Augmented Generation
Egocentric Video
Multimodal Retrieval
Temporal Granularity
V-RAGBench
CARVE

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.