STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Summary
STiTch is a novel Semantic Transition and Transportation in collaboration framework designed for training-free zero-shot composed image retrieval (CIR) tasks. It addresses two key limitations of existing LLM-based approaches: the introduction of unexpected features from reference images due to semantic gaps between input image and text modifications, and the inability of point-to-point alignment to capture diverse compositions during retrieval. STiTch refines LLM-inferred captions by applying a transition vector in the embedding space, moving them closer to the target image and focusing on core modification intent while filtering noise. Furthermore, it redefines the retrieval task as a set-to-set alignment, treating captions and images as discrete distributions. A bidirectional transportation distance is then employed to achieve fine-grained cross-modal alignments and calculate retrieval scores. Extensive experiments confirm STiTch's generalizability, effectiveness, and benefits across various CIR tasks.
Key takeaway
For Machine Learning Engineers developing zero-shot composed image retrieval systems, consider integrating STiTch's approach to overcome limitations in LLM-based methods. You should refine LLM-generated captions using embedding space transition vectors to filter noise and enhance modification intent. Additionally, reformulate retrieval as a set-to-set alignment with bidirectional transportation distance to capture diverse compositions more effectively, potentially improving accuracy and generalizability in unseen multimodal scenarios.
Key insights
Training-free zero-shot CIR improves by refining LLM-generated captions and using set-to-set alignment for diverse compositions.
Principles
- Semantic gaps introduce noise in LLM-generated captions.
- Point-to-point alignment limits diverse compositions.
- Refine captions in embedding space for core intent.
Method
STiTch refines LLM-inferred captions via an embedding space transition vector and models retrieval as a set-to-set alignment using discrete distributions and a bidirectional transportation distance for fine-grained cross-modal scoring.
In practice
- Refine LLM outputs with embedding space adjustments.
- Use set-to-set alignment for complex multimodal retrieval.
- Apply bidirectional transportation distance for fine-grained scoring.
Topics
- Composed Image Retrieval
- Zero-Shot Learning
- Large Language Models
- Semantic Embedding
- Multimodal Retrieval
- Set-to-Set Alignment
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.