Would you still call this Dax? Novel Visual References in VLMs and Humans
Summary
The Novel Visual References Dataset (NVRD) is introduced, comprising 19,176 images across 90 visual concepts, each with up to 20 perturbed versions. This dataset, featuring known, composed, and fully novel entities, investigates how vision-language models (VLMs) and humans map novel visual references to language, especially when prior knowledge is contradicted. Researchers evaluated three open-source VLMs (Qwen-2 VL 7B, Idefics-3 8B, Molmo-2 8B) and two closed-source models (GPT-4o Mini, Gemini-2.5 Flash), alongside 2,400 human judgments. Findings indicate that models struggle to acquire novel concepts in-context when they conflict with pre-training knowledge. While models and humans exhibit correlated sensitivity to visual perturbations, models significantly overgeneralize, applying learned labels to stimuli that humans would reject. NVRD serves as a benchmark for visual concept learning research.
Key takeaway
For AI Scientists developing vision-language models, you should recognize that your models are prone to overgeneralizing novel visual concepts, especially when these contradict existing pre-training knowledge. You must rigorously test concept acquisition and generalization using diverse, genuinely novel, and systematically perturbed visual data, such as the NVRD. This approach helps identify and mitigate overgeneralization risks, ensuring more reliable model behavior in real-world applications involving new or unfamiliar objects.
Key insights
VLMs overgeneralize novel visual concepts, especially when conflicting with prior knowledge, despite human-like sensitivity to perturbations.
Principles
- Shape bias is central to human and machine concept generalization.
- Prior knowledge can hinder in-context acquisition of novel concepts.
- Models overgeneralize novel labels more than humans.
Method
NVRD constructs novel, open-ended visual stimuli with up to 20 compounding perturbation levels across 11 axes, evaluated via in-context learning and token probabilities.
In practice
- Test VLM generalization with genuinely novel, perturbed stimuli.
- Prioritize shape-based changes for robust concept evaluation.
- Compare VLM outputs against human judgments for overgeneralization.
Topics
- Vision-Language Models
- Concept Learning
- Dataset Generation
- Visual Generalization
- In-Context Learning
- Human-AI Comparison
- Visual Perturbations
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.