LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Local Modality Substitution (LoMo) is a lightweight, architecture-agnostic data curation paradigm designed to address the "carrier sensitivity" issue in Vision-Language Models (VLMs). This problem causes significant performance degradation when textual questions are replaced by their rendered-image counterparts, stemming from an inherent bias in training corpora where text and images typically serve distinct roles. LoMo tackles this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences, dynamically selecting text spans and recasting them as rendered images. This process preserves semantic equivalence across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks show LoMo consistently improves multimodal reasoning and cross-modal fusion, delivering gains of 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B over standard SFT.

Key takeaway

For machine learning engineers developing robust Vision-Language Models, LoMo presents a lightweight, architecture-agnostic data curation paradigm to mitigate "carrier sensitivity." You should consider integrating this approach into your VLM training pipelines to achieve deeper cross-modal fusion and consistent performance gains, as demonstrated on models like LLaVA-OneVision-1.5-8B and Qwen3.5-9B. This can significantly enhance your model's reasoning capabilities under modality substitution.

Key insights

LoMo addresses VLM "carrier sensitivity" by creating semantically equivalent text and image representations through local modality substitution.

Principles

VLMs exhibit "carrier sensitivity" due to data bias in training corpora.
Cross-modal representational invariance is crucial for robust VLM reasoning.

Method

LoMo reformulates single-modality prompts into interleaved multimodal sequences, dynamically selecting text spans and recasting them as rendered images to ensure semantic equivalence across modalities.

In practice

Apply LoMo to improve VLM performance on diverse multimodal benchmarks.
Use LoMo with foundational models like LLaVA-OneVision-1.5-8B and Qwen3.5-9B.

Topics

Vision-Language Models
Modality Substitution
Cross-modal Fusion
Data Curation
Carrier Sensitivity
LLaVA-OneVision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.