MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Multimodal large language models (MLLMs) frequently exhibit a "late-layer textual override" bias, where they prioritize textual input over conflicting visual evidence, even when images provide clear contradictory information. Researchers discovered that these models often generate correct vision-based predictions in their intermediate layers, only to subsequently shift towards text-based outputs. This phenomenon is characterized by a directional signature: 85% of prediction failures involve a shift towards text, while 89% of successes show a shift towards vision. To address this, a training-free method called CALRD (Conflict-Aware Layer Reference Decoding) is proposed. CALRD intervenes at inference time by detecting and restoring confident visual predictions that were suppressed. Experiments across five MLLMs with diverse architectures demonstrated absolute improvements of up to 9.4% on conflict benchmarks, while maintaining standard performance.

Key takeaway

For Machine Learning Engineers deploying multimodal large language models (MLLMs) in visually-grounded applications, you should be aware of the "late-layer textual override" bias that compromises visual accuracy. This research indicates that your MLLMs often know the right visual answer internally. By implementing CALRD, a training-free inference method, you can recover these suppressed visual predictions, achieving up to 9.4% absolute improvement on conflict benchmarks and enhancing model reliability without costly retraining.

Key insights

MLLMs often correctly process visual information in early layers, but a "late-layer textual override" bias shifts final predictions towards text.

Principles

Method

CALRD (Conflict-Aware Layer Reference Decoding) is a training-free inference method. It detects confident visual predictions suppressed in late layers and restores them, recovering what the model already knew.

In practice

Topics

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.