LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Local Modality Substitution (LoMo) is a lightweight, architecture-agnostic data curation paradigm designed to address the "carrier sensitivity" issue in Vision-Language Models (VLMs). This problem causes significant performance degradation when textual questions are replaced by their rendered-image counterparts, stemming from an inherent bias in training corpora where text and images typically serve distinct roles. LoMo tackles this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences, dynamically selecting text spans and recasting them as rendered images. This process preserves semantic equivalence across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks show LoMo consistently improves multimodal reasoning and cross-modal fusion, delivering gains of 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B over standard SFT.

Key takeaway

For machine learning engineers developing robust Vision-Language Models, LoMo presents a lightweight, architecture-agnostic data curation paradigm to mitigate "carrier sensitivity." You should consider integrating this approach into your VLM training pipelines to achieve deeper cross-modal fusion and consistent performance gains, as demonstrated on models like LLaVA-OneVision-1.5-8B and Qwen3.5-9B. This can significantly enhance your model's reasoning capabilities under modality substitution.

Key insights

LoMo addresses VLM "carrier sensitivity" by creating semantically equivalent text and image representations through local modality substitution.

Principles

Method

LoMo reformulates single-modality prompts into interleaved multimodal sequences, dynamically selecting text spans and recasting them as rendered images to ensure semantic equivalence across modalities.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.