Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning
Summary
ViToS, a novel dual-stream reinforcement learning (RL) framework, addresses the challenge of sparse visual evidence in medical images for vision-language models (VLMs). It proposes a united RL approach for active visual token pruning (VTP) and medical multimodal reasoning. ViToS trains a single policy model with two distinct task branches: one dedicated to grounding and the other to token-sparse reasoning post-VTP. To manage coupled policy learning, the framework employs cross-feedback sequential optimization, which prevents gradient conflicts and ensures shared policy model convergence. Evaluated across seven medical benchmarks, ViToS successfully reduces visual tokens to 77% of their original sequence length. This efficiency translates to superior performance, achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B, alongside notable inference speedup.
Key takeaway
For AI Scientists and Research Scientists developing vision-language models for medical applications, ViToS presents a compelling paradigm shift. You should consider integrating active visual token pruning (VTP) and a dual-stream reinforcement learning architecture to overcome challenges with sparse visual evidence. This approach can significantly reduce visual token length by 23% while boosting performance on benchmarks like Lingshu-7B and HuatuoGPT-Vision-7B, ultimately delivering superior inference speedup for your medical reasoning systems.
Key insights
ViToS employs dual-stream reinforcement learning and active visual token pruning to significantly improve medical multimodal reasoning with sparse visual evidence.
Principles
- Pruning visual tokens improves medical reasoning.
- Dual-stream RL handles coupled reasoning tasks.
- Cross-feedback optimizes shared policy learning.
Method
ViToS trains a single policy model with two branches: one for grounding and another for token-sparse reasoning post-VTP. Cross-feedback sequential optimization manages coupled policy learning, preventing gradient conflicts and facilitating convergence.
In practice
- Reduce visual tokens for VLM efficiency.
- Apply VTP to sparse medical images.
- Use dual-stream RL for complex VLM tasks.
Topics
- Medical Imaging
- Vision-Language Models
- Reinforcement Learning
- Token Pruning
- Multimodal Reasoning
- ViToS Framework
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.