LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
Summary
Local Modality Substitution (LoMo) is a lightweight, architecture-agnostic data curation paradigm designed to address the "carrier sensitivity" issue in Vision-Language Models (VLMs). This problem causes significant performance degradation when textual questions are replaced by their rendered-image counterparts, stemming from an inherent bias in training corpora where text and images typically serve distinct roles. LoMo tackles this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences, dynamically selecting text spans and recasting them as rendered images. This process preserves semantic equivalence across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks show LoMo consistently improves multimodal reasoning and cross-modal fusion, delivering gains of 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B over standard SFT.
Key takeaway
For machine learning engineers developing robust Vision-Language Models, LoMo presents a lightweight, architecture-agnostic data curation paradigm to mitigate "carrier sensitivity." You should consider integrating this approach into your VLM training pipelines to achieve deeper cross-modal fusion and consistent performance gains, as demonstrated on models like LLaVA-OneVision-1.5-8B and Qwen3.5-9B. This can significantly enhance your model's reasoning capabilities under modality substitution.
Key insights
LoMo addresses VLM "carrier sensitivity" by creating semantically equivalent text and image representations through local modality substitution.
Principles
- VLMs exhibit "carrier sensitivity" due to data bias in training corpora.
- Cross-modal representational invariance is crucial for robust VLM reasoning.
Method
LoMo reformulates single-modality prompts into interleaved multimodal sequences, dynamically selecting text spans and recasting them as rendered images to ensure semantic equivalence across modalities.
In practice
- Apply LoMo to improve VLM performance on diverse multimodal benchmarks.
- Use LoMo with foundational models like LLaVA-OneVision-1.5-8B and Qwen3.5-9B.
Topics
- Vision-Language Models
- Modality Substitution
- Cross-modal Fusion
- Data Curation
- Carrier Sensitivity
- LLaVA-OneVision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.