COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

2026-06-03 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

COMBINER is a novel Composed Image Retrieval (CIR) network designed to overcome limitations in existing methods that often overlook visually similar images differing in attributes, which can undermine multimodal feature fusion and similarity modeling. It introduces a unified representation of cross-modal features based on attribute prototypes. The network comprises three key modules: an Adaptive Semantic Disentanglement module for attribute feature disentanglement, a Unified Prototype-based Composition module for constructing cross-modal unified prototypes (CUP) and facilitating feature composition, and a Dual Relations Modeling module for mining pairwise and neighbor relations based on attribute similarity. COMBINER is the first study to specifically address visually similar but attribute-unrelated samples, achieving more accurate semantic understanding via an attribute prototype-based similarity metric. Experiments on three benchmark datasets confirm its effectiveness, with implementation available at https://github.com/Lee-zixu/COMBINER.

Key takeaway

For Machine Learning Engineers developing Composed Image Retrieval systems, COMBINER offers a robust solution for accurately distinguishing visually similar images that possess differing attributes. You should consider integrating attribute prototype-based similarity metrics and semantic disentanglement modules into your multimodal retrieval models. This approach can significantly enhance the precision of your systems by resolving ambiguities that traditional methods often overlook, leading to more relevant search results.

Key insights

COMBINER enhances Composed Image Retrieval by leveraging attribute prototypes to resolve visual similarity with attribute differences.

Principles

Attribute prototypes unify cross-modal features.
Disentangling attribute semantics improves fusion.
Attribute-based neighbor relations refine similarity.

Method

COMBINER employs Adaptive Semantic Disentanglement, Unified Prototype-based Composition to form Cross-modal Unified Prototypes (CUP), and Dual Relations Modeling to mine attribute-based pairwise and neighbor relations.

In practice

Apply attribute prototypes for CIR.
Disentangle attributes in multimodal inputs.
Model attribute-based neighbor relations.

Topics

Composed Image Retrieval
Multimodal Learning
Attribute Prototypes
Semantic Disentanglement
Image Similarity Modeling
Deep Learning

Code references

Lee-zixu/COMBINER

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.