Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

A zero-shot reason-then-retrieve system has been developed for CoVR-R (reason-aware composed video retrieval), a task requiring inference of a target video from a reference video and an edit instruction. This pipeline, built around Qwen3.5-27B, generates retrieval-oriented structured descriptions and dense embeddings for gallery videos by pooling generated-token hidden states. For queries, it first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states form the query embedding. The system complements dense retrieval with a TF-IDF branch over generated texts, fusing the two rankings with split-specific weights. On validation, it achieved 80.81 R@1, 94.86 R@5, 97.11 R@10, and 98.59 R@50. On the blind test split, it reached 89.73 R@1, 95.79 R@5, 96.63 R@10, and 97.98 R@50. Prompt refinement significantly improved results, demonstrating the importance of preserving action chains, state transitions, and hand-object interactions.

Key takeaway

For Machine Learning Engineers developing video retrieval systems, especially for complex tasks like CoVR-R, you should prioritize structured prompt design to ensure high-fidelity video descriptions. Implement a dense-sparse retrieval fusion strategy, as this approach significantly boosts performance by leveraging both semantic understanding and exact term matching. Focus on preserving action order and final states in your representations to overcome common failure modes.

Key insights

Structured prompts and dense-sparse fusion are crucial for effective reason-aware composed video retrieval.

Principles

Representation fidelity is key for complex video retrieval.
Dense and sparse retrieval are complementary.
Structured prompts improve description quality.

Method

The pipeline involves structured text generation and dense embedding for gallery videos, query-side edit reasoning, target-video description generation, and fusion of dense and TF-IDF sparse retrieval scores.

In practice

Use Qwen3.5-27B for zero-shot video description.
Implement token-weighted hidden-state pooling.
Combine dense and TF-IDF retrieval for robustness.

Topics

Composed Video Retrieval
Reason-Then-Retrieve
Qwen3.5-27B
Dense-Sparse Fusion
Structured Prompts
Video Understanding

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.