Thinking Before Retrieving: Robust Zero-Shot Composed Image Retrieval via Strategic Planning and Self-Criticism
Summary
Composed image retrieval, which identifies a target image by integrating a reference image with textual modifications, faces challenges in training-free zero-shot settings. Existing single-pass generation strategies for constructing retrieval-oriented textual queries often lead to semantic distortions and omissions, causing interference between reference attribute preservation and textual requirement integration, thus degrading retrieval precision. To address this, PEC-CIR is introduced as a training-free framework that structures query construction as a multi-stage reasoning pipeline. This framework employs a Planner-Executor-Critic architecture where the Planner extracts explicit constraints, the Executor generates multiple candidate target descriptions, and the Critic evaluates these candidates for constraint compliance. By reframing query construction as a staged inference process, PEC-CIR reduces generative error propagation and improves retrieval stability.
Key takeaway
For AI Engineers developing robust zero-shot composed image retrieval systems, you should consider adopting a multi-stage reasoning pipeline for query construction. This approach, exemplified by the Planner-Executor-Critic architecture, explicitly evaluates candidate queries before retrieval, significantly reducing the propagation of generative errors. Implementing such a staged inference process can enhance retrieval precision and stability, overcoming limitations of single-pass generation strategies.
Key insights
Multi-stage reasoning with self-criticism significantly enhances zero-shot composed image retrieval by reducing generative errors.
Principles
- Single-pass query generation in CIR risks semantic distortions.
- Explicit constraint extraction improves query construction accuracy.
- Staged inference with candidate evaluation boosts retrieval stability.
Method
The PEC-CIR framework uses a Planner to extract explicit constraints, an Executor to generate multiple candidate descriptions, and a Critic to evaluate these candidates based on constraint compliance.
In practice
- Implement a multi-stage pipeline for complex query generation.
- Incorporate a self-criticism module to validate generated queries.
Topics
- Composed Image Retrieval
- Zero-Shot Learning
- Vision-Language Models
- Query Construction
- Multi-stage Reasoning
- Self-Criticism
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.