The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Multimodal language models (MLMs) consistently exhibit reduced performance on visual perception tasks, a phenomenon not fully understood. Researchers introduced "centroid replacement," a method that collapses each token to its nearest K-means centroid, to investigate modal dependence. This technique revealed a universal imbalance across seven models from three architecture families: erasing text centroid structure resulted in a 4x greater accuracy loss compared to erasing visual centroid structure. This indicates that language representations dominate vision, even in tasks requiring strong visual reasoning. Exploiting this asymmetry, a novel text centroid contrastive decoding method was developed, which improved accuracy by up to +16.9% on specific tasks. The effectiveness of this intervention varied with training approaches, yielding average gains of +5.6% for standard fine-tuned models versus +1.5% for preference-optimized models.

Key takeaway

For AI Engineers and Research Scientists developing or deploying multimodal language models, understanding and mitigating modal competition is crucial. The finding that language representations overshadow vision suggests that you should consider implementing inference-time interventions like text centroid contrastive decoding to recover significant accuracy on visual perception tasks, especially for standard fine-tuned models. This approach offers a path to improve model performance without costly retraining.

Key insights

Language representations in MLMs systematically overshadow vision, causing underperformance in visual tasks.

Principles

Method

Centroid replacement collapses tokens to K-means centroids to probe modal dependence. Text centroid contrastive decoding uses a text-centroid-erased reference to recover accuracy.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.