StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs
Summary
StrLoRA is a novel framework designed for Streaming Continual Visual Instruction Tuning (StrCVIT) in Multimodal Large Language Models (MLLMs), addressing the challenge of learning from continuous, interleaved data streams with dynamically evolving tasks. Unlike traditional Continual Visual Instruction Tuning (CVIT) methods that assume discrete, single-task training phases, StrCVIT simulates real-world conditions where models must simultaneously acquire new abilities, reinforce recurring ones, and mitigate forgetting. StrLoRA employs a regularized two-stage expert routing mechanism: first, it uses textual instructions for task-aware expert selection to activate a sparse subset of relevant LoRA experts, reducing cross-task interference. Second, it applies token-wise expert weighting within this subset, computing contribution weights via cross-modal attention. A routing-stability regularization, based on an exponential moving average, ensures smooth evolution of expert allocations. Experiments on a new StrCVIT benchmark, using InternVL3.5-8B-Pretrained and Gemma3-4B-PT, show StrLoRA significantly outperforms existing methods, achieving superior Mean Average Performance (MAP) of 65.37% and lower Mean Average Forgetting (MAF) of 1.91% on InternVL3.5-8B-Pretrained over 25 data chunks.
Key takeaway
For research scientists and MLLM developers working on continuous learning systems, StrLoRA offers a robust approach to handle dynamic, interleaved task streams. You should consider adopting its two-stage expert routing and routing-stability regularization to enhance model adaptation and minimize catastrophic forgetting in real-world, non-stationary data environments. This framework can lead to more stable and performant MLLMs in production.
Key insights
Textual instructions provide a stable signal for expert routing in streaming multimodal learning.
Principles
- Decouple expert selection from token weighting.
- Regularize routing to maintain stability.
- Textual instructions offer task-discriminative signals.
Method
StrLoRA uses a two-stage expert routing: task-aware expert selection via text embeddings, followed by token-wise expert weighting using cross-modal attention, stabilized by EMA-based routing regularization.
In practice
- Use textual instructions for robust task discrimination.
- Implement two-stage expert routing for MLLMs.
- Apply EMA regularization to stabilize expert allocation.
Topics
- Streaming Continual Visual Instruction Tuning
- Multimodal Large Language Models
- StrLoRA Framework
- Expert Routing
- Catastrophic Forgetting
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.