Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion
Summary
A zero-shot reason-then-retrieve system has been developed for CoVR-R (reason-aware composed video retrieval), a task requiring inference of a target video from a reference video and an edit instruction. This pipeline, built around Qwen3.5-27B, generates retrieval-oriented structured descriptions and dense embeddings for gallery videos by pooling generated-token hidden states. For queries, it first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states form the query embedding. The system complements dense retrieval with a TF-IDF branch over generated texts, fusing the two rankings with split-specific weights. On validation, it achieved 80.81 R@1, 94.86 R@5, 97.11 R@10, and 98.59 R@50. On the blind test split, it reached 89.73 R@1, 95.79 R@5, 96.63 R@10, and 97.98 R@50. Prompt refinement significantly improved results, demonstrating the importance of preserving action chains, state transitions, and hand-object interactions.
Key takeaway
For Machine Learning Engineers developing video retrieval systems, especially for complex tasks like CoVR-R, you should prioritize structured prompt design to ensure high-fidelity video descriptions. Implement a dense-sparse retrieval fusion strategy, as this approach significantly boosts performance by leveraging both semantic understanding and exact term matching. Focus on preserving action order and final states in your representations to overcome common failure modes.
Key insights
Structured prompts and dense-sparse fusion are crucial for effective reason-aware composed video retrieval.
Principles
- Representation fidelity is key for complex video retrieval.
- Dense and sparse retrieval are complementary.
- Structured prompts improve description quality.
Method
The pipeline involves structured text generation and dense embedding for gallery videos, query-side edit reasoning, target-video description generation, and fusion of dense and TF-IDF sparse retrieval scores.
In practice
- Use Qwen3.5-27B for zero-shot video description.
- Implement token-weighted hidden-state pooling.
- Combine dense and TF-IDF retrieval for robustness.
Topics
- Composed Video Retrieval
- Reason-Then-Retrieve
- Qwen3.5-27B
- Dense-Sparse Fusion
- Structured Prompts
- Video Understanding
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.