Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

2026-06-05 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

ZeroSight introduces a novel benchmark for Zero-Shot Composed Image Retrieval (ZS-CIR), designed to overcome limitations in existing datasets that often contain irrelevant reference-target image pairs or fail to ensure a true zero-shot scenario due to pre-training on public image datasets like CLIP. ZeroSight provides a dataset with consistent reference-target pairs sourced from videos, a robust data construction pipeline, and advanced evaluation methods. Consistency is achieved by extracting frames from single videos and generating captions via LLM-assisted techniques. To guarantee a genuine zero-shot environment, ZeroSight uses video data published after March 31, 2022, avoiding CLIP's pre-training. Additionally, the authors propose SC4CIR, a training-free MLLM-driven method employing 3 symmetric consistency checks to identify hard negative targets, significantly improving performance. Experimental results across 27 methods indicate existing ZS-CIR datasets and metrics inflate retrieval performance. The benchmark and models are available on GitHub.

Key takeaway

For Computer Vision Engineers developing or benchmarking Zero-Shot Composed Image Retrieval (ZS-CIR) methods, you should adopt the ZeroSight benchmark. Its use of video data published after March 31, 2022, ensures a genuine zero-shot scenario, preventing inflated performance metrics from pre-trained models like CLIP. Additionally, consider integrating the training-free SC4CIR method to effectively identify hard negative targets, which can significantly improve your method's robustness and real-world applicability. This will provide a more accurate assessment of your model's true capabilities.

Key insights

ZeroSight provides a true zero-shot CIR benchmark and a training-free MLLM method, revealing inflated existing performance.

Principles

ZS-CIR benchmarks require truly unseen data.
Consistent reference-target pairs are crucial.
Symmetric consistency improves hard negative identification.

Method

SC4CIR is a training-free MLLM-driven method that identifies hard negative targets through 3 symmetric consistency checks, integrating seamlessly with various CIR methods.

In practice

Use ZeroSight for ZS-CIR evaluation.
Implement SC4CIR for hard negative identification.
Verify dataset novelty against model pre-training.

Topics

Zero-Shot Image Retrieval
Composed Image Retrieval
Benchmarking
Multimodal LLMs
Computer Vision
Video Datasets

Code references

sotayang/ZeroSight

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.