TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Composed Image Retrieval (CIR) allows users to find images using a reference image and modification text. Existing CIR setups struggle with complex, multi-modification texts due to "Insufficient Entity Coverage" and "Clause-Entity Misalignment," limiting practical application. To address these, researchers developed TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification queries while also supporting simple modifications. They also created two new instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. Extensive experiments across four benchmark datasets demonstrate TEMA's superior performance in both original and multi-modification scenarios, achieving an optimal balance between retrieval accuracy and computational efficiency. The code and datasets are publicly available.

Key takeaway

For research scientists developing advanced image retrieval systems, TEMA offers a robust solution for handling complex, multi-modification queries. You should consider integrating TEMA's architecture and leveraging the M-FashionIQ and M-CIRR datasets to improve the real-world applicability and accuracy of your CIR models, moving beyond simple text modifications to more nuanced user requests.

Key insights

TEMA is a novel CIR framework addressing complex multi-modification queries with new datasets.

Principles

Method

TEMA, the Text-oriented Entity Mapping Architecture, processes multi-modification queries by anchoring the image and following text instructions, accommodating both simple and complex modifications.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.