MAny: Merge Anything for Multimodal Continual Instruction Tuning

2026-04-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

MAny (Merge Anything) is a novel framework designed to address catastrophic forgetting in Multimodal Continual Instruction Tuning (MCIT) for Multimodal Large Language Models (MLLMs). The framework identifies and resolves a dual-forgetting phenomenon impacting both perception drift in the Cross-modal Projection Space and reasoning collapse in the Low-rank Parameter Space. MAny achieves this by integrating two key components: Cross-modal Projection Merging (CPM) and Low-rank Paragraph Merging (LPM). CPM restores perceptual alignment by adaptively merging cross-modal visual representations using visual-prototype guidance, while LPM recursively merges low-rank weight matrices to mitigate interference among task-specific modules. Operating as a training-free paradigm, MAny utilizes efficient CPU-based algebraic operations for knowledge merging. Evaluations on the UCIT benchmark demonstrate MAny's superior performance, achieving up to 8.57% and 2.85% higher final average accuracy over existing methods across two distinct MLLMs.

Key takeaway

For AI Engineers developing MLLMs that require sequential task adaptation, MAny offers a robust, training-free solution to combat catastrophic forgetting. You should consider integrating MAny's Cross-modal Projection Merging and Low-rank Paragraph Merging components to maintain both perceptual accuracy and reasoning stability across diverse tasks, potentially improving performance by up to 8.57% on benchmarks like UCIT.

Key insights

MAny mitigates dual-forgetting in MLLMs via training-free merging of perceptual and reasoning knowledge.

Principles

Catastrophic forgetting impacts both perception and reasoning in MLLMs.
Algebraic merging can provide optimal fusion for reasoning stability.

Method

MAny merges task-specific knowledge using Cross-modal Projection Merging (CPM) for perceptual alignment and Low-rank Paragraph Merging (LPM) for reasoning stability, via CPU-based algebraic operations.

In practice

Apply MAny for sequential task adaptation in MLLMs.
Utilize CPM for visual representation recovery.
Employ LPM to prevent low-rank module interference.

Topics

Multimodal Continual Instruction Tuning
Catastrophic Forgetting
Multimodal Large Language Models
Cross-modal Projection Merging
Low-rank Paragraph Merging

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.