Would you still call this Dax? Novel Visual References in VLMs and Humans

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The Novel Visual References Dataset (NVRD) is introduced, comprising 19,176 images across 90 visual concepts, each with up to 20 perturbed versions. This dataset, featuring known, composed, and fully novel entities, investigates how vision-language models (VLMs) and humans map novel visual references to language, especially when prior knowledge is contradicted. Researchers evaluated three open-source VLMs (Qwen-2 VL 7B, Idefics-3 8B, Molmo-2 8B) and two closed-source models (GPT-4o Mini, Gemini-2.5 Flash), alongside 2,400 human judgments. Findings indicate that models struggle to acquire novel concepts in-context when they conflict with pre-training knowledge. While models and humans exhibit correlated sensitivity to visual perturbations, models significantly overgeneralize, applying learned labels to stimuli that humans would reject. NVRD serves as a benchmark for visual concept learning research.

Key takeaway

For AI Scientists developing vision-language models, you should recognize that your models are prone to overgeneralizing novel visual concepts, especially when these contradict existing pre-training knowledge. You must rigorously test concept acquisition and generalization using diverse, genuinely novel, and systematically perturbed visual data, such as the NVRD. This approach helps identify and mitigate overgeneralization risks, ensuring more reliable model behavior in real-world applications involving new or unfamiliar objects.

Key insights

VLMs overgeneralize novel visual concepts, especially when conflicting with prior knowledge, despite human-like sensitivity to perturbations.

Principles

Method

NVRD constructs novel, open-ended visual stimuli with up to 20 compounding perturbation levels across 11 axes, evaluated via in-context learning and token probabilities.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.