Learning to Select Visual In-Context Demonstrations

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

The paper introduces Learning to Select Demonstrations (LSD), a novel Reinforcement Learning (RL) framework designed to optimize in-context learning (ICL) for Multimodal Large Language Models (MLLMs) on visual regression tasks. Unlike traditional k-Nearest Neighbor (kNN) methods that prioritize visual similarity, LSD reframes demonstration selection as a sequential decision-making problem, training a Dueling DQN agent with a query-centric Transformer Decoder. This agent learns a policy to balance visual relevance with diversity, aiming to maximize MLLM downstream performance. Evaluated across five visual regression benchmarks (UTKFace, AVA, SCUT-FBP5500, KonIQ-10k, KADID-10k) using MLLMs like Gemma 3 4B-it, Qwen 2.5 7B, and Phi-3.5-vision, LSD significantly outperforms kNN on objective, factual regression tasks (e.g., age prediction, image quality assessment) by selecting diverse "boundary" examples. However, kNN remains optimal for subjective preference tasks (e.g., aesthetic rating). The framework leverages FAISS for efficient action selection in large datasets and demonstrates cross-MLLM generalization.

Key takeaway

For AI Engineers and Research Scientists working on visual in-context learning, consider adopting the LSD framework for objective regression tasks. Its ability to select diverse, boundary-defining demonstrations, rather than just visually similar ones, can significantly improve MLLM accuracy on factual predictions like age or image quality. However, for subjective tasks like aesthetic rating, stick with kNN-based similarity retrieval, as LSD's diversity might introduce unnecessary variance.

Key insights

Learned, diversity-aware demonstration selection significantly improves MLLM performance on objective visual regression tasks.

Principles

Method

LSD trains a Dueling DQN with a query-centric Transformer Decoder to sequentially select demonstrations, optimizing for MLLM performance via a differential reward signal and using FAISS for large action spaces.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.