Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Vision-Language Models (VLMs) operating in high-resolution environments often exhibit "lazy perception," where they mimic active perception operations like zooming and panning without truly relying on their visual outputs. This behavior stems from a learning asymmetry where coarse global views, combined with language priors, are sufficient for moderate task accuracy, removing the incentive for VLMs to learn complex multi-step visual search. To address this, the "Starve to Perceive" training paradigm restricts each visual observation to a tight token budget, forcing the model to actively engage in perception because no single view provides enough information for task completion. This method, implemented as a minimal plug-in modification to standard post-training, achieves an average relative improvement of 5% across various benchmarks without requiring auxiliary losses, reward shaping, or architectural changes.

Key takeaway

For research scientists developing or deploying VLMs in complex visual environments, you should consider integrating the "Starve to Perceive" paradigm. This approach can significantly improve your model's active perception capabilities by forcing it to genuinely learn visual search, leading to more robust performance without architectural changes or complex reward shaping.

Key insights

Constraining visual bandwidth forces VLMs to genuinely learn and utilize active perception strategies.

Principles

Method

The "Starve to Perceive" paradigm constrains visual bandwidth by restricting each observation to a tight token budget, making active perception the only viable strategy for task completion.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.