MAny: Merge Anything for Multimodal Continual Instruction Tuning
Summary
MAny (Merge Anything) is a novel framework designed to address catastrophic forgetting in Multimodal Continual Instruction Tuning (MCIT) for Multimodal Large Language Models (MLLMs). The framework identifies and resolves a dual-forgetting phenomenon impacting both perception drift in the Cross-modal Projection Space and reasoning collapse in the Low-rank Parameter Space. MAny achieves this by integrating two key components: Cross-modal Projection Merging (CPM) and Low-rank Paragraph Merging (LPM). CPM restores perceptual alignment by adaptively merging cross-modal visual representations using visual-prototype guidance, while LPM recursively merges low-rank weight matrices to mitigate interference among task-specific modules. Operating as a training-free paradigm, MAny utilizes efficient CPU-based algebraic operations for knowledge merging. Evaluations on the UCIT benchmark demonstrate MAny's superior performance, achieving up to 8.57% and 2.85% higher final average accuracy over existing methods across two distinct MLLMs.
Key takeaway
For AI Engineers developing MLLMs that require sequential task adaptation, MAny offers a robust, training-free solution to combat catastrophic forgetting. You should consider integrating MAny's Cross-modal Projection Merging and Low-rank Paragraph Merging components to maintain both perceptual accuracy and reasoning stability across diverse tasks, potentially improving performance by up to 8.57% on benchmarks like UCIT.
Key insights
MAny mitigates dual-forgetting in MLLMs via training-free merging of perceptual and reasoning knowledge.
Principles
- Catastrophic forgetting impacts both perception and reasoning in MLLMs.
- Algebraic merging can provide optimal fusion for reasoning stability.
Method
MAny merges task-specific knowledge using Cross-modal Projection Merging (CPM) for perceptual alignment and Low-rank Paragraph Merging (LPM) for reasoning stability, via CPU-based algebraic operations.
In practice
- Apply MAny for sequential task adaptation in MLLMs.
- Utilize CPM for visual representation recovery.
- Employ LPM to prevent low-rank module interference.
Topics
- Multimodal Continual Instruction Tuning
- Catastrophic Forgetting
- Multimodal Large Language Models
- Cross-modal Projection Merging
- Low-rank Paragraph Merging
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.