Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Summary
A recent study investigates the persistent lack of reasoning capabilities in Vision-Language Models (VLMs), attributing this to a reporting bias present in their training data. This bias arises because human communication about visual content often omits tacit information necessary for supervising certain reasoning types, such as describing "37 people standing behind a field" instead of "at the game today!". Researchers analyzed the training data for OpenCLIP, LLaVA-1.5, and Molmo, finding that reporting bias leads to underrepresentation of four key reasoning skills: spatial, temporal, negation, and counting. Benchmarks confirmed that VLMs perform poorly on these suppressed reasoning types, and crucially, scaling data size, model size, or language diversity does not inherently foster these skills. However, incorporating specifically collected annotations that capture tacit information proved effective in improving VLM performance.
Key takeaway
For research scientists developing Vision-Language Models, relying solely on increased data or model scale will not automatically yield advanced reasoning capabilities. You should prioritize intentional training data curation methods that explicitly capture tacit information, particularly for spatial, temporal, negation, and counting skills, to overcome inherent reporting biases and improve VLM reasoning.
Key insights
Reporting bias in VLM training data hinders reasoning, which scaling alone cannot resolve.
Principles
- Human communication omits tacit visual information.
- Reporting bias suppresses specific reasoning skills.
Method
The study investigated VLM training data (OpenCLIP, LLaVA-1.5, Molmo) through pragmatic theories, identifying underrepresented reasoning skills, and then validated findings with curated benchmarks and targeted annotations.
In practice
- Curate training data for tacit information.
- Focus on spatial, temporal, negation, counting.
Topics
- Vision-Language Models
- Reporting Bias
- Data Curation
- Reasoning Capabilities
- Pragmatics
Best for: Research Scientist, AI Researcher, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.