Learning to Select Visual In-Context Demonstrations
Summary
The paper introduces Learning to Select Demonstrations (LSD), a novel Reinforcement Learning (RL) framework designed to optimize in-context learning (ICL) for Multimodal Large Language Models (MLLMs) on visual regression tasks. Unlike traditional k-Nearest Neighbor (kNN) methods that prioritize visual similarity, LSD reframes demonstration selection as a sequential decision-making problem, training a Dueling DQN agent with a query-centric Transformer Decoder. This agent learns a policy to balance visual relevance with diversity, aiming to maximize MLLM downstream performance. Evaluated across five visual regression benchmarks (UTKFace, AVA, SCUT-FBP5500, KonIQ-10k, KADID-10k) using MLLMs like Gemma 3 4B-it, Qwen 2.5 7B, and Phi-3.5-vision, LSD significantly outperforms kNN on objective, factual regression tasks (e.g., age prediction, image quality assessment) by selecting diverse "boundary" examples. However, kNN remains optimal for subjective preference tasks (e.g., aesthetic rating). The framework leverages FAISS for efficient action selection in large datasets and demonstrates cross-MLLM generalization.
Key takeaway
For AI Engineers and Research Scientists working on visual in-context learning, consider adopting the LSD framework for objective regression tasks. Its ability to select diverse, boundary-defining demonstrations, rather than just visually similar ones, can significantly improve MLLM accuracy on factual predictions like age or image quality. However, for subjective tasks like aesthetic rating, stick with kNN-based similarity retrieval, as LSD's diversity might introduce unnecessary variance.
Key insights
Learned, diversity-aware demonstration selection significantly improves MLLM performance on objective visual regression tasks.
Principles
- Optimal demonstration selection is task-dependent.
- Diversity is crucial for objective regression tasks.
- Similarity is effective for subjective preference tasks.
Method
LSD trains a Dueling DQN with a query-centric Transformer Decoder to sequentially select demonstrations, optimizing for MLLM performance via a differential reward signal and using FAISS for large action spaces.
In practice
- Use LSD for objective visual regression tasks.
- Employ kNN for subjective visual preference tasks.
- Prioritize diverse examples for factual predictions.
Topics
- Visual In-Context Learning
- Demonstration Selection
- Reinforcement Learning
- Dueling DQN
- Transformer Decoder
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.