MAny: Merge Anything for Multimodal Continual Instruction Tuning

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

MAny is a training-free framework designed to address catastrophic forgetting in Multimodal Large Language Models (MLLMs) during continual instruction tuning. It identifies a "dual-forgetting" phenomenon, encompassing perception drift in the cross-modal projection space and reasoning collapse in the low-rank parameter space. To counter this, MAny employs two main components: Cross-modal Projection Merging (CPM) and Low-rank Parameter Merging (LPM). CPM adaptively merges task-specific visual representations using visual-prototype guidance to restore perceptual alignment. LPM recursively consolidates task-specific LoRA weights using a Recursive Least Squares algorithm, providing a closed-form solution to minimize interference and ensure reasoning stability. Evaluated on UCIT and MLLM-DCL benchmarks, MAny achieved state-of-the-art performance, with up to an 8.57% increase in Final Average Accuracy on LLaVA-1.5-7B and a significant reduction in forgetting.

Key takeaway

For research scientists and MLLM engineers developing continually adapting models, MAny offers a robust, training-free solution to catastrophic forgetting. By addressing both perceptual and reasoning degradation through its dual-track merging design, MAny significantly improves model stability and accuracy on sequential tasks. You should consider integrating MAny's CPM and LPM modules to enhance your MLLMs' ability to learn new tasks without compromising previously acquired knowledge, especially in resource-constrained environments where GPU-based retraining is impractical.

Key insights

Catastrophic forgetting in MLLMs stems from dual perception and reasoning degradation, addressable via training-free merging.

Principles

Dual-forgetting impacts MLLMs in perception and reasoning.
Model merging can consolidate knowledge without retraining.
Recursive Least Squares optimizes parameter fusion.

Method

MAny decouples perception and reasoning, using CPM for adaptive cross-modal visual feature merging via prototypes, and LPM for recursive, conflict-minimizing low-rank parameter consolidation via Recursive Least Squares.

In practice

Use CPM as a plug-and-play module for MLLM projector stability.
Apply LPM for efficient, exemplar-free LoRA weight merging.
Scale merged task vectors with \(\lambda=3\) for optimal performance.

Topics

Multimodal Continual Instruction Tuning
Catastrophic Forgetting
Multimodal Large Language Models
Cross-modal Projection Merging
Low-rank Parameter Merging

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.