Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models
Summary
The VISE (Visual Invariance Self-Evolution) framework addresses "visual under-conditioning" in self-evolving large multimodal models (LMMs), a persistent failure where decoders rely on language priors instead of visual content. Existing unsupervised LMMs, like those using multi-role self-play, optimize for answer agreement but often neglect visual token attention. VISE, proposed on 2026-06-25, operates within a single model without external reward models or annotations, training on raw unlabeled images. It directly regularizes the model's visual conditioning policy through two invariance-based rewards: geometric invariance for spatial consistency under transformations, and semantic invariance to penalize evidence-agnostic generation. Experiments across 18 benchmarks show VISE, using Qwen3-VL-2B, achieved gains of +16.85 CIDEr on COCO and +19.66 CIDEr on TextCaps, while reducing object hallucination by 5.0 Chair-I points.
Key takeaway
For Machine Learning Engineers developing self-evolving Large Multimodal Models, integrating VISE's invariance-based reward schemes is critical. This approach directly addresses visual under-conditioning by enforcing spatial and semantic consistency, significantly improving performance on tasks like image captioning and visual question answering. Your models could achieve substantial gains, such as +16.85 CIDEr on COCO, without needing external annotations or specialist roles, enhancing overall reliability and accuracy.
Key insights
Self-evolving LMMs often under-condition visually; VISE uses invariance rewards to enforce spatial and semantic consistency.
Principles
- LMMs can rely on language priors over visual evidence.
- Invariance-based rewards improve visual grounding.
- Spatial and semantic consistency are crucial for LMMs.
Method
VISE regularizes visual conditioning using geometric invariance (spatial consistency under transformations) and semantic invariance (penalizing evidence-agnostic generation when predicted regions are perturbed).
In practice
- Integrate invariance rewards into LMM training.
- Test spatial consistency with known transformations.
- Evaluate models for evidence-agnostic generation.
Topics
- Large Multimodal Models
- Self-Evolving AI
- Visual Under-conditioning
- Invariance Rewards
- Image Captioning
- Visual Question Answering
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.