Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Life Sciences & Biology, Research Methodology & Innovation · Depth: Expert, extended

Summary

This study investigates the susceptibility of 12 open-weight vision-language models (VLMs), ranging from 256M to 10B parameters across 6 architecture families, to sycophantic manipulation. Researchers measured "brain alignment" by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest. Sycophancy was assessed using 76,800 two-turn gaslighting prompts across 5 categories and 10 difficulty levels. The key finding is that alignment specifically in the early visual cortex (V1–V3) is a reliable negative predictor of sycophancy ($r=-0.441$, BCa 95% CI $[-0.740,-0.031]$), with the strongest effect against existence denial attacks ($r=-0.597$, $p=0.040$). This relationship was not observed in higher-order category-selective regions, suggesting that faithful low-level visual encoding helps VLMs resist adversarial linguistic pressure. The code and dataset are publicly available on GitHub and Hugging Face.

Key takeaway

For Computer Vision Engineers developing or deploying VLMs, understanding that strong alignment with human early visual cortex (V1–V3) correlates with increased resistance to sycophantic manipulation is crucial. You should consider incorporating brain-alignment metrics, particularly those focused on low-level visual processing, into your VLM evaluation and training pipelines to enhance robustness against adversarial linguistic attacks like existence denial.

Key insights

Early visual cortex alignment in VLMs negatively predicts sycophancy, especially against existence denial attacks.

Principles

Method

The study uses a three-stage pipeline: extract VLM vision encoder features to predict fMRI responses, evaluate sycophancy with gaslighting prompts, and correlate brain alignment with sycophancy rates.

In practice

Topics

Code references

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.