Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

CLARA is a novel clarification framework designed to resolve user intent ambiguity in composed image retrieval (CIR), where queries combine a reference image and text modification. Existing CIR methods often struggle with ambiguous queries that describe multiple potential targets, relying on text questions and single-turn conformal prediction. CLARA addresses this by presenting users with a small panel of visual alternative prototypes, allowing direct selection of the intended target. This approach provides a clear visual signal, avoiding model-predicted answers. To ensure valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio from user selections. Prototypes are constrained to represent the candidate set and are snapped to real corpus images. Experiments demonstrate CLARA matches single-turn leading retrieval performance, maintains nominal coverage across rounds, and finds targets in fewer rounds than text-question baselines, particularly for viewpoint or fine-grained attribute ambiguities.

Key takeaway

For Computer Vision Engineers developing composed image retrieval systems, you should consider integrating visual disambiguation frameworks like CLARA. If your current system relies on textual clarification for ambiguous queries, recognize its limitations, especially with fine-grained visual differences. Adopting a visual prototype selection mechanism can significantly reduce interaction rounds and improve target finding accuracy, while maintaining robust coverage guarantees across multiple user interactions. This approach offers a more direct and effective path to resolving user intent.

Key insights

CLARA uses visual prototypes and reweighted conformal prediction to resolve CIR ambiguity more effectively than text questions.

Principles

Visual disambiguation is superior for fine-grained attributes.
Conformal guarantees can be maintained across turns via reweighting.
Direct user visual selection avoids model prediction errors.

Method

CLARA presents visual prototypes for user selection, reweighting conformal calibration with likelihood ratios from selections to maintain multi-round coverage, and snapping prototypes to real images.

In practice

Implement visual prototype panels for ambiguous queries.
Apply likelihood ratio reweighting for multi-turn guarantees.
Prioritize visual over textual clarification for fine details.

Topics

Composed Image Retrieval
Visual Disambiguation
Conformal Prediction
Generative Models
Human-Computer Interaction
Image Search

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.