Egocentric Bias in Vision-Language Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new diagnostic benchmark, FlipSet, evaluates Level-2 Visual Perspective Taking (L2 VPT) in 103 Vision-Language Models (VLMs), revealing a systematic "egocentric bias." The benchmark requires models to simulate a 180-degree rotation of 2D character strings from another agent's viewpoint, isolating spatial transformation from 3D scene complexity. Results show that 91.3% of VLMs perform below the 25% chance level, with an average accuracy of 8.96%. A striking 75.88% of errors are egocentric, meaning models reproduce the camera's viewpoint. Control experiments on 24 models further expose a compositional deficit: models achieve high Theory-of-Mind (ToM) accuracy (90.4%) and above-chance mental rotation (26.1%) in isolation, but catastrophically fail L2 VPT (10.3%) when integration of these abilities is required. This indicates VLMs lack mechanisms to bind social awareness to spatial operations.

Key takeaway

For Computer Vision Engineers developing multimodal AI, this research highlights a critical limitation: current VLMs struggle with Level-2 visual perspective taking due to an inability to integrate social awareness with spatial reasoning. You should prioritize architectural innovations that support model-based spatial reasoning and targeted training on perspective-invariant representations, rather than solely relying on scaling existing pattern-matching mechanisms, to achieve more human-like situated social intelligence.

Key insights

VLMs exhibit a profound egocentric bias, failing to integrate social awareness with spatial transformation for Level-2 visual perspective taking.

Principles

Method

FlipSet uses 2D character string rotation tasks and control conditions to isolate Theory of Mind, Mental Rotation, and L2 VPT, evaluating 103 VLMs under zero-shot conditions with diagnostic error analysis.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.