MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

2026-06-16 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Multimodal large language models (MLLMs) frequently exhibit a "late-layer textual override" bias, where they prioritize textual input over conflicting visual evidence, even when images provide clear contradictory information. Researchers discovered that these models often generate correct vision-based predictions in their intermediate layers, only to subsequently shift towards text-based outputs. This phenomenon is characterized by a directional signature: 85% of prediction failures involve a shift towards text, while 89% of successes show a shift towards vision. To address this, a training-free method called CALRD (Conflict-Aware Layer Reference Decoding) is proposed. CALRD intervenes at inference time by detecting and restoring confident visual predictions that were suppressed. Experiments across five MLLMs with diverse architectures demonstrated absolute improvements of up to 9.4% on conflict benchmarks, while maintaining standard performance.

Key takeaway

For Machine Learning Engineers deploying multimodal large language models (MLLMs) in visually-grounded applications, you should be aware of the "late-layer textual override" bias that compromises visual accuracy. This research indicates that your MLLMs often know the right visual answer internally. By implementing CALRD, a training-free inference method, you can recover these suppressed visual predictions, achieving up to 9.4% absolute improvement on conflict benchmarks and enhancing model reliability without costly retraining.

Key insights

MLLMs often correctly process visual information in early layers, but a "late-layer textual override" bias shifts final predictions towards text.

Principles

MLLM failures shift 85% toward text.
MLLM successes shift 89% toward vision.
Intermediate layers hold correct visual predictions.

Method

CALRD (Conflict-Aware Layer Reference Decoding) is a training-free inference method. It detects confident visual predictions suppressed in late layers and restores them, recovering what the model already knew.

In practice

Apply CALRD to MLLMs for visual grounding.
Mitigate textual bias in MLLM applications.

Topics

Multimodal Large Language Models
Textual Bias
Visual Grounding
CALRD
Inference Optimization
Bias Correction

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.