Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

2025-04-05 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Foveated Reasoner (FoveateR) is an autoregressive vision-language framework designed to overcome the computational overhead of high-resolution images in Vision-Language Models (VLMs) by integrating foveation and reasoning into a single decoding trajectory. Inspired by human visual foveation, FoveateR starts with a low-resolution image view and selectively acquires high-resolution evidence from specific regions only when needed, injecting it back into the ongoing generation process without resetting the model's hidden state. This approach avoids the multiple decoding passes and interrupted reasoning states common in multi-pass methods, as well as the token overhead and format brittleness of text-grounded methods. The model is trained using a two-stage pipeline: an initial coldstart supervised finetuning to bootstrap foveation behavior, followed by reinforcement learning with a Group-Relative Policy Optimization (GRPO) objective to jointly improve evidence acquisition and task accuracy while discouraging the trivial "see-everything" solution through foveated region regularization. Experiments show FoveateR, built on Qwen2.5-VL (3B and 7B), achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks, including Visual CoT and V* Bench.

Key takeaway

For Computer Vision Engineers or Research Scientists developing efficient VLMs, FoveateR offers a compelling alternative to traditional multi-pass or text-grounded visual focusing. You should consider adopting a stateful, single-pass, and non-linguistic action-based foveation mechanism to reduce compute overhead and maintain reasoning continuity. Implementing a two-stage training approach with RL can help your models learn adaptive, evidence-efficient foveation policies, leading to improved accuracy under strict visual-token budgets.

Key insights

FoveateR unifies stateful, action-based visual foveation and textual reasoning within a single VLM decoding trajectory.

Principles

Foveation should be stateful and action-based, not text-grounded or multi-pass.
Adaptive foveation improves accuracy under tight visual-token budgets.
Reinforcement learning can optimize foveation policies without human-annotated trajectories.

Method

FoveateR uses a two-stage training: coldstart SFT for initial foveation behavior, then RL with GRPO and foveated region regularization to optimize accuracy and evidence efficiency.

In practice

Integrate foveation as non-linguistic actions within a single decoding pass.
Use RL to fine-tune foveation policies for problem-adaptive behavior.
Penalize large foveated regions to encourage evidence-efficient solutions.

Topics

Foveated Reasoning
Vision-Language Models
Visual Focusing
Autoregressive Decoding
Reinforcement Learning

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.