MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias
Summary
Multimodal large language models (MLLMs) frequently exhibit a "late-layer textual override" bias, where they prioritize textual input over conflicting visual evidence, even when images provide clear contradictory information. Researchers discovered that these models often generate correct vision-based predictions in their intermediate layers, only to subsequently shift towards text-based outputs. This phenomenon is characterized by a directional signature: 85% of prediction failures involve a shift towards text, while 89% of successes show a shift towards vision. To address this, a training-free method called CALRD (Conflict-Aware Layer Reference Decoding) is proposed. CALRD intervenes at inference time by detecting and restoring confident visual predictions that were suppressed. Experiments across five MLLMs with diverse architectures demonstrated absolute improvements of up to 9.4% on conflict benchmarks, while maintaining standard performance.
Key takeaway
For Machine Learning Engineers deploying multimodal large language models (MLLMs) in visually-grounded applications, you should be aware of the "late-layer textual override" bias that compromises visual accuracy. This research indicates that your MLLMs often know the right visual answer internally. By implementing CALRD, a training-free inference method, you can recover these suppressed visual predictions, achieving up to 9.4% absolute improvement on conflict benchmarks and enhancing model reliability without costly retraining.
Key insights
MLLMs often correctly process visual information in early layers, but a "late-layer textual override" bias shifts final predictions towards text.
Principles
- MLLM failures shift 85% toward text.
- MLLM successes shift 89% toward vision.
- Intermediate layers hold correct visual predictions.
Method
CALRD (Conflict-Aware Layer Reference Decoding) is a training-free inference method. It detects confident visual predictions suppressed in late layers and restores them, recovering what the model already knew.
In practice
- Apply CALRD to MLLMs for visual grounding.
- Mitigate textual bias in MLLM applications.
Topics
- Multimodal Large Language Models
- Textual Bias
- Visual Grounding
- CALRD
- Inference Optimization
- Bias Correction
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.