Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs
Summary
Visual Information Gain In aLignment (VIGIL) is a new reinforcement-learning (RL) post-training framework designed to mitigate "visual laziness" and hallucinations in Multimodal Large Language Models (MLLMs). MLLMs often encode correct visual evidence but over-rely on strong language priors, leading to responses that contradict visual inputs. VIGIL addresses this by shifting from numerical reward fitting to causal visual grounding, introducing a geometric constraint that maximizes mutual information between the visual input and the generated response. It penalizes "blind confidence" instances where the model remains improperly certain even when textual-visual attention is masked to create a counterfactual blind state. Experiments show VIGIL consistently outperforms recent alignment methods across hallucination and reasoning benchmarks, matching state-of-the-art performance with only 25% of the preference data and demonstrating emergent spatial grounding capabilities without explicit bounding box supervision.
Key takeaway
For Machine Learning Engineers developing Multimodal Large Language Models, if you are struggling with visual hallucinations or inefficient data usage, you should explore VIGIL. This reinforcement-learning framework offers a robust method to improve causal visual grounding. It reduces "blind confidence" by penalizing responses that contradict masked visual inputs. VIGIL achieves state-of-the-art performance using only 25% of preference data. This can streamline your training processes and enable emergent spatial grounding without explicit bounding box supervision.
Key insights
VIGIL mitigates MLLM visual laziness and hallucinations by causally grounding responses to visual input, penalizing "blind confidence" via counterfactual alignment.
Principles
- MLLMs exhibit visual laziness, over-relying on language priors.
- Outcome-level reward optimization can bias MLLMs toward linguistic shortcuts.
- Maximizing mutual information between visual input and response improves grounding.
Method
VIGIL is an RL post-training framework that uses a geometric constraint to maximize mutual information between visual input and response. It penalizes "blind confidence" by masking textual-visual attention to create a counterfactual blind state.
In practice
- Achieve state-of-the-art MLLM performance with 25% less preference data.
- Enable emergent spatial grounding without explicit bounding box supervision.
Topics
- Multimodal LLMs
- Visual Hallucinations
- Reinforcement Learning
- Counterfactual Alignment
- Causal Visual Grounding
- Spatial Grounding
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.