The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Summary
Multimodal language models (MLMs) consistently exhibit reduced performance on visual perception tasks, a phenomenon not fully understood. Researchers introduced "centroid replacement," a method that collapses each token to its nearest K-means centroid, to investigate modal dependence. This technique revealed a universal imbalance across seven models from three architecture families: erasing text centroid structure resulted in a 4x greater accuracy loss compared to erasing visual centroid structure. This indicates that language representations dominate vision, even in tasks requiring strong visual reasoning. Exploiting this asymmetry, a novel text centroid contrastive decoding method was developed, which improved accuracy by up to +16.9% on specific tasks. The effectiveness of this intervention varied with training approaches, yielding average gains of +5.6% for standard fine-tuned models versus +1.5% for preference-optimized models.
Key takeaway
For AI Engineers and Research Scientists developing or deploying multimodal language models, understanding and mitigating modal competition is crucial. The finding that language representations overshadow vision suggests that you should consider implementing inference-time interventions like text centroid contrastive decoding to recover significant accuracy on visual perception tasks, especially for standard fine-tuned models. This approach offers a path to improve model performance without costly retraining.
Key insights
Language representations in MLMs systematically overshadow vision, causing underperformance in visual tasks.
Principles
- Modal competition is structurally localized.
- Asymmetry can be exploited at inference time.
Method
Centroid replacement collapses tokens to K-means centroids to probe modal dependence. Text centroid contrastive decoding uses a text-centroid-erased reference to recover accuracy.
In practice
- Apply contrastive decoding for visual task improvement.
- Diagnose modal competition in new MLM architectures.
Topics
- Multimodal Language Models
- Centroid Erasure
- Modal Competition
- Visual Perception Tasks
- Text Centroid Contrastive Decoding
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.