STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

2026-05-20 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, medium

Summary

STiTch is a novel Semantic Transition and Transportation in collaboration framework designed for training-free zero-shot composed image retrieval (CIR) tasks. It addresses two key limitations of existing LLM-based approaches: the introduction of unexpected features from reference images due to semantic gaps between input image and text modifications, and the inability of point-to-point alignment to capture diverse compositions during retrieval. STiTch refines LLM-inferred captions by applying a transition vector in the embedding space, moving them closer to the target image and focusing on core modification intent while filtering noise. Furthermore, it redefines the retrieval task as a set-to-set alignment, treating captions and images as discrete distributions. A bidirectional transportation distance is then employed to achieve fine-grained cross-modal alignments and calculate retrieval scores. Extensive experiments confirm STiTch's generalizability, effectiveness, and benefits across various CIR tasks.

Key takeaway

For Machine Learning Engineers developing zero-shot composed image retrieval systems, consider integrating STiTch's approach to overcome limitations in LLM-based methods. You should refine LLM-generated captions using embedding space transition vectors to filter noise and enhance modification intent. Additionally, reformulate retrieval as a set-to-set alignment with bidirectional transportation distance to capture diverse compositions more effectively, potentially improving accuracy and generalizability in unseen multimodal scenarios.

Key insights

Training-free zero-shot CIR improves by refining LLM-generated captions and using set-to-set alignment for diverse compositions.

Principles

Semantic gaps introduce noise in LLM-generated captions.
Point-to-point alignment limits diverse compositions.
Refine captions in embedding space for core intent.

Method

STiTch refines LLM-inferred captions via an embedding space transition vector and models retrieval as a set-to-set alignment using discrete distributions and a bidirectional transportation distance for fine-grained cross-modal scoring.

In practice

Refine LLM outputs with embedding space adjustments.
Use set-to-set alignment for complex multimodal retrieval.
Apply bidirectional transportation distance for fine-grained scoring.

Topics

Composed Image Retrieval
Zero-Shot Learning
Large Language Models
Semantic Embedding
Multimodal Retrieval
Set-to-Set Alignment

Code references

mohammad2012191/ViC

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.