Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings
Summary
SAGA is a framework that uses frozen multimodal large language models (MLLMs) to generate attribute-aware training signals for vision encoders. It moves beyond scalar distance supervision by enabling MLLMs to articulate specific visual attributes between image pairs. SAGA employs Group Relative Policy Optimization (GRPO). This rewards the MLLM for correct predictions on encoder tokens. It pushes the encoder to encode these attributes. An attention-distillation loss anchors the encoder's embedding to MLLM-attended tokens. A standard metric-learning loss also shapes the embedding geometry. The MLLM is frozen and discarded at inference, matching baseline deployment costs. SAGA improves Recall@1 by 3-6 points over baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, iNaturalist Aves in zero-shot image retrieval.
Key takeaway
For ML engineers developing vision encoders for retrieval, SAGA offers a method to move beyond simple scalar distances. By leveraging frozen MLLMs for attribute-resolved supervision, you can significantly improve Recall@1 by 3 to 6 points on zero-shot tasks. Consider integrating this language-grounded approach to enhance embedding quality and retrieval accuracy without incurring MLLM inference costs.
Key insights
SAGA uses frozen MLLMs to provide attribute-resolved supervision for vision encoders, moving beyond scalar distances.
Principles
- MLLMs can articulate visual attributes for training signals.
- Attribute-resolved supervision improves vision encoder training.
- Frozen MLLMs can be discarded post-training.
Method
SAGA employs Group Relative Policy Optimization (GRPO) to reward a frozen MLLM for correct predictions on vision encoder tokens, pushing the encoder to expose specific attributes. It adds attention-distillation and metric-learning losses.
In practice
- Enhance zero-shot image retrieval performance.
- Train vision encoders with attribute-aware supervision.
- Utilize frozen MLLMs for efficient training signals.
Topics
- Vision Encoders
- Multimodal LLMs
- Image Retrieval
- Semantic Attributes
- Zero-shot Learning
- Group Relative Policy Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.