Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Summary
This study investigates the susceptibility of 12 open-weight vision-language models (VLMs), ranging from 256M to 10B parameters across 6 architecture families, to sycophantic manipulation. Researchers measured "brain alignment" by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest. Sycophancy was assessed using 76,800 two-turn gaslighting prompts across 5 categories and 10 difficulty levels. The key finding is that alignment specifically in the early visual cortex (V1–V3) is a reliable negative predictor of sycophancy ($r=-0.441$, BCa 95% CI $[-0.740,-0.031]$), with the strongest effect against existence denial attacks ($r=-0.597$, $p=0.040$). This relationship was not observed in higher-order category-selective regions, suggesting that faithful low-level visual encoding helps VLMs resist adversarial linguistic pressure. The code and dataset are publicly available on GitHub and Hugging Face.
Key takeaway
For Computer Vision Engineers developing or deploying VLMs, understanding that strong alignment with human early visual cortex (V1–V3) correlates with increased resistance to sycophantic manipulation is crucial. You should consider incorporating brain-alignment metrics, particularly those focused on low-level visual processing, into your VLM evaluation and training pipelines to enhance robustness against adversarial linguistic attacks like existence denial.
Key insights
Early visual cortex alignment in VLMs negatively predicts sycophancy, especially against existence denial attacks.
Principles
- Faithful low-level visual encoding grounds VLM behavior.
- Brain alignment varies across cortical regions and model factors.
Method
The study uses a three-stage pipeline: extract VLM vision encoder features to predict fMRI responses, evaluate sycophancy with gaslighting prompts, and correlate brain alignment with sycophancy rates.
In practice
- Prioritize early visual cortex alignment for VLM robustness.
- Use the provided sycophancy evaluation framework for VLM testing.
Topics
- Vision-Language Models
- Brain Alignment
- Sycophantic Manipulation
- Neural Predictivity
- Adversarial Robustness
Code references
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.