Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
Summary
Vision-Language Models (VLMs) operating in high-resolution environments often exhibit "lazy perception," where they mimic active perception operations like zooming and panning without truly relying on their visual outputs. This behavior stems from a learning asymmetry where coarse global views, combined with language priors, are sufficient for moderate task accuracy, removing the incentive for VLMs to learn complex multi-step visual search. To address this, the "Starve to Perceive" training paradigm restricts each visual observation to a tight token budget, forcing the model to actively engage in perception because no single view provides enough information for task completion. This method, implemented as a minimal plug-in modification to standard post-training, achieves an average relative improvement of 5% across various benchmarks without requiring auxiliary losses, reward shaping, or architectural changes.
Key takeaway
For research scientists developing or deploying VLMs in complex visual environments, you should consider integrating the "Starve to Perceive" paradigm. This approach can significantly improve your model's active perception capabilities by forcing it to genuinely learn visual search, leading to more robust performance without architectural changes or complex reward shaping.
Key insights
Constraining visual bandwidth forces VLMs to genuinely learn and utilize active perception strategies.
Principles
- Learning asymmetry hinders active perception.
- Incentivize active looking by limiting initial views.
Method
The "Starve to Perceive" paradigm constrains visual bandwidth by restricting each observation to a tight token budget, making active perception the only viable strategy for task completion.
In practice
- Apply token budget constraints during VLM training.
- Integrate into existing post-training pipelines.
Topics
- Vision-Language Models
- Active Perception
- Lazy Perception
- Starve to Perceive
- Visual Bandwidth Constraint
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.