StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

StrLoRA is a novel framework designed for Streaming Continual Visual Instruction Tuning (StrCVIT) in Multimodal Large Language Models (MLLMs), addressing the challenge of learning from continuous, interleaved data streams with dynamically evolving tasks. Unlike traditional Continual Visual Instruction Tuning (CVIT) methods that assume discrete, single-task training phases, StrCVIT simulates real-world conditions where models must simultaneously acquire new abilities, reinforce recurring ones, and mitigate forgetting. StrLoRA employs a regularized two-stage expert routing mechanism: first, it uses textual instructions for task-aware expert selection to activate a sparse subset of relevant LoRA experts, reducing cross-task interference. Second, it applies token-wise expert weighting within this subset, computing contribution weights via cross-modal attention. A routing-stability regularization, based on an exponential moving average, ensures smooth evolution of expert allocations. Experiments on a new StrCVIT benchmark, using InternVL3.5-8B-Pretrained and Gemma3-4B-PT, show StrLoRA significantly outperforms existing methods, achieving superior Mean Average Performance (MAP) of 65.37% and lower Mean Average Forgetting (MAF) of 1.91% on InternVL3.5-8B-Pretrained over 25 data chunks.

Key takeaway

For research scientists and MLLM developers working on continuous learning systems, StrLoRA offers a robust approach to handle dynamic, interleaved task streams. You should consider adopting its two-stage expert routing and routing-stability regularization to enhance model adaptation and minimize catastrophic forgetting in real-world, non-stationary data environments. This framework can lead to more stable and performant MLLMs in production.

Key insights

Textual instructions provide a stable signal for expert routing in streaming multimodal learning.

Principles

Method

StrLoRA uses a two-stage expert routing: task-aware expert selection via text embeddings, followed by token-wise expert weighting using cross-modal attention, stabilized by EMA-based routing regularization.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.