Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

A zero-shot reason-then-retrieve system has been developed for CoVR-R (reason-aware composed video retrieval), a task requiring inference of a target video from a reference video and an edit instruction. This pipeline, built around Qwen3.5-27B, generates retrieval-oriented structured descriptions and dense embeddings for gallery videos by pooling generated-token hidden states. For queries, it first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states form the query embedding. The system complements dense retrieval with a TF-IDF branch over generated texts, fusing the two rankings with split-specific weights. On validation, it achieved 80.81 R@1, 94.86 R@5, 97.11 R@10, and 98.59 R@50. On the blind test split, it reached 89.73 R@1, 95.79 R@5, 96.63 R@10, and 97.98 R@50. Prompt refinement significantly improved results, demonstrating the importance of preserving action chains, state transitions, and hand-object interactions.

Key takeaway

For Machine Learning Engineers developing video retrieval systems, especially for complex tasks like CoVR-R, you should prioritize structured prompt design to ensure high-fidelity video descriptions. Implement a dense-sparse retrieval fusion strategy, as this approach significantly boosts performance by leveraging both semantic understanding and exact term matching. Focus on preserving action order and final states in your representations to overcome common failure modes.

Key insights

Structured prompts and dense-sparse fusion are crucial for effective reason-aware composed video retrieval.

Principles

Method

The pipeline involves structured text generation and dense embedding for gallery videos, query-side edit reasoning, target-video description generation, and fusion of dense and TF-IDF sparse retrieval scores.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.