Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets
Summary
ZeroSight introduces a novel benchmark for Zero-Shot Composed Image Retrieval (ZS-CIR), addressing critical limitations in existing datasets. Current ZS-CIR benchmarks often feature inconsistent reference-target image pairs and fail to ensure a true zero-shot scenario, as their data frequently overlaps with pre-training sets of models like CLIP. ZeroSight constructs its dataset from 12,048 diverse videos published after March 31, 2022, ensuring visual and semantic consistency between reference and target images, and guaranteeing data unseen by CLIP. The benchmark includes 197,313 candidate images and 54,740 queries, each with an average of 5.16 positive and 10.89 negative target images. Additionally, the paper proposes SC4CIR, a training-free MLLM-driven method that uses symmetric consistency checks to identify hard negative targets, improving average mAP by 5.90% and PNR-mAP by 12.86%. Experimental results from 27 methods demonstrate that existing datasets inflate retrieval performance, with ZeroSight's PNR-mAP revealing a 22.93% lower average performance than standard mAP.
Key takeaway
For AI Scientists and Computer Vision Engineers developing or evaluating Zero-Shot Composed Image Retrieval (ZS-CIR) systems, you should critically assess benchmark validity. Traditional datasets often inflate performance due to data overlap with pre-trained models like CLIP. Adopt the ZeroSight benchmark for a truly zero-shot evaluation with consistent video-sourced data and its PNR-mAP metric. Additionally, consider integrating the training-free SC4CIR method to significantly improve hard negative identification and overall retrieval accuracy.
Key insights
Genuine zero-shot image retrieval benchmarks require novel, consistent data and robust evaluation to counter inflated performance from pre-trained models.
Principles
- Dataset consistency is crucial for accurate ZS-CIR evaluation.
- Avoid data overlap with VLM pre-training for true zero-shot.
- Symmetric consistency checks improve hard negative identification.
Method
SC4CIR re-ranks retrieval results by combining forward similarity (S_1) with two MLLM-driven reverse consistency checks (S_2 for I^t-T \to I^r, S_3 for I^t-I^r \to T).
In practice
- Utilize video frames for consistent reference-target image pairs.
- Filter video data by publication date to ensure true zero-shot.
- Implement MLLM-driven reverse checks for hard negative re-ranking.
Topics
- Zero-Shot Composed Image Retrieval
- Image Retrieval Benchmarking
- MLLMs
- CLIP Models
- Video-Sourced Datasets
- Hard Negative Mining
Code references
Best for: AI Scientist, Computer Vision Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.