Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ZeroSight introduces a novel benchmark for Zero-Shot Composed Image Retrieval (ZS-CIR), addressing critical limitations in existing datasets. Current ZS-CIR benchmarks often feature inconsistent reference-target image pairs and fail to ensure a true zero-shot scenario, as their data frequently overlaps with pre-training sets of models like CLIP. ZeroSight constructs its dataset from 12,048 diverse videos published after March 31, 2022, ensuring visual and semantic consistency between reference and target images, and guaranteeing data unseen by CLIP. The benchmark includes 197,313 candidate images and 54,740 queries, each with an average of 5.16 positive and 10.89 negative target images. Additionally, the paper proposes SC4CIR, a training-free MLLM-driven method that uses symmetric consistency checks to identify hard negative targets, improving average mAP by 5.90% and PNR-mAP by 12.86%. Experimental results from 27 methods demonstrate that existing datasets inflate retrieval performance, with ZeroSight's PNR-mAP revealing a 22.93% lower average performance than standard mAP.

Key takeaway

For AI Scientists and Computer Vision Engineers developing or evaluating Zero-Shot Composed Image Retrieval (ZS-CIR) systems, you should critically assess benchmark validity. Traditional datasets often inflate performance due to data overlap with pre-trained models like CLIP. Adopt the ZeroSight benchmark for a truly zero-shot evaluation with consistent video-sourced data and its PNR-mAP metric. Additionally, consider integrating the training-free SC4CIR method to significantly improve hard negative identification and overall retrieval accuracy.

Key insights

Genuine zero-shot image retrieval benchmarks require novel, consistent data and robust evaluation to counter inflated performance from pre-trained models.

Principles

Dataset consistency is crucial for accurate ZS-CIR evaluation.
Avoid data overlap with VLM pre-training for true zero-shot.
Symmetric consistency checks improve hard negative identification.

Method

SC4CIR re-ranks retrieval results by combining forward similarity (S_1) with two MLLM-driven reverse consistency checks (S_2 for I^t-T \to I^r, S_3 for I^t-I^r \to T).

In practice

Utilize video frames for consistent reference-target image pairs.
Filter video data by publication date to ensure true zero-shot.
Implement MLLM-driven reverse checks for hard negative re-ranking.

Topics

Zero-Shot Composed Image Retrieval
Image Retrieval Benchmarking
MLLMs
CLIP Models
Video-Sourced Datasets
Hard Negative Mining

Code references

sotayang/ZeroSight

Best for: AI Scientist, Computer Vision Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.