TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
Summary
Composed Image Retrieval (CIR) allows users to find images using a reference image and modification text. Existing CIR setups struggle with complex, multi-modification texts due to "Insufficient Entity Coverage" and "Clause-Entity Misalignment," limiting practical application. To address these, researchers developed TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification queries while also supporting simple modifications. They also created two new instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. Extensive experiments across four benchmark datasets demonstrate TEMA's superior performance in both original and multi-modification scenarios, achieving an optimal balance between retrieval accuracy and computational efficiency. The code and datasets are publicly available.
Key takeaway
For research scientists developing advanced image retrieval systems, TEMA offers a robust solution for handling complex, multi-modification queries. You should consider integrating TEMA's architecture and leveraging the M-FashionIQ and M-CIRR datasets to improve the real-world applicability and accuracy of your CIR models, moving beyond simple text modifications to more nuanced user requests.
Key insights
TEMA is a novel CIR framework addressing complex multi-modification queries with new datasets.
Principles
- CIR benefits from handling complex text modifications.
- Entity coverage and alignment are critical for practical CIR.
Method
TEMA, the Text-oriented Entity Mapping Architecture, processes multi-modification queries by anchoring the image and following text instructions, accommodating both simple and complex modifications.
In practice
- Use M-FashionIQ for fashion-related CIR tasks.
- Utilize M-CIRR for general multi-modification CIR.
Topics
- Composed Image Retrieval
- TEMA Architecture
- Multi-Modification Datasets
- M-FashionIQ
- M-CIRR
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.