Would you still call this Dax? Novel Visual References in VLMs and Humans

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The Novel Visual References Dataset (NVRD) is introduced, comprising 19,176 images across 90 visual concepts, each with up to 20 perturbed versions. This dataset, featuring known, composed, and fully novel entities, investigates how vision-language models (VLMs) and humans map novel visual references to language, especially when prior knowledge is contradicted. Researchers evaluated three open-source VLMs (Qwen-2 VL 7B, Idefics-3 8B, Molmo-2 8B) and two closed-source models (GPT-4o Mini, Gemini-2.5 Flash), alongside 2,400 human judgments. Findings indicate that models struggle to acquire novel concepts in-context when they conflict with pre-training knowledge. While models and humans exhibit correlated sensitivity to visual perturbations, models significantly overgeneralize, applying learned labels to stimuli that humans would reject. NVRD serves as a benchmark for visual concept learning research.

Key takeaway

For AI Scientists developing vision-language models, you should recognize that your models are prone to overgeneralizing novel visual concepts, especially when these contradict existing pre-training knowledge. You must rigorously test concept acquisition and generalization using diverse, genuinely novel, and systematically perturbed visual data, such as the NVRD. This approach helps identify and mitigate overgeneralization risks, ensuring more reliable model behavior in real-world applications involving new or unfamiliar objects.

Key insights

VLMs overgeneralize novel visual concepts, especially when conflicting with prior knowledge, despite human-like sensitivity to perturbations.

Principles

Shape bias is central to human and machine concept generalization.
Prior knowledge can hinder in-context acquisition of novel concepts.
Models overgeneralize novel labels more than humans.

Method

NVRD constructs novel, open-ended visual stimuli with up to 20 compounding perturbation levels across 11 axes, evaluated via in-context learning and token probabilities.

In practice

Test VLM generalization with genuinely novel, perturbed stimuli.
Prioritize shape-based changes for robust concept evaluation.
Compare VLM outputs against human judgments for overgeneralization.

Topics

Vision-Language Models
Concept Learning
Dataset Generation
Visual Generalization
In-Context Learning
Human-AI Comparison
Visual Perturbations

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.