Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, quick

Summary

ViToS, a novel dual-stream reinforcement learning (RL) framework, addresses the challenge of sparse visual evidence in medical images for vision-language models (VLMs). It proposes a united RL approach for active visual token pruning (VTP) and medical multimodal reasoning. ViToS trains a single policy model with two distinct task branches: one dedicated to grounding and the other to token-sparse reasoning post-VTP. To manage coupled policy learning, the framework employs cross-feedback sequential optimization, which prevents gradient conflicts and ensures shared policy model convergence. Evaluated across seven medical benchmarks, ViToS successfully reduces visual tokens to 77% of their original sequence length. This efficiency translates to superior performance, achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B, alongside notable inference speedup.

Key takeaway

For AI Scientists and Research Scientists developing vision-language models for medical applications, ViToS presents a compelling paradigm shift. You should consider integrating active visual token pruning (VTP) and a dual-stream reinforcement learning architecture to overcome challenges with sparse visual evidence. This approach can significantly reduce visual token length by 23% while boosting performance on benchmarks like Lingshu-7B and HuatuoGPT-Vision-7B, ultimately delivering superior inference speedup for your medical reasoning systems.

Key insights

ViToS employs dual-stream reinforcement learning and active visual token pruning to significantly improve medical multimodal reasoning with sparse visual evidence.

Principles

Pruning visual tokens improves medical reasoning.
Dual-stream RL handles coupled reasoning tasks.
Cross-feedback optimizes shared policy learning.

Method

ViToS trains a single policy model with two branches: one for grounding and another for token-sparse reasoning post-VTP. Cross-feedback sequential optimization manages coupled policy learning, preventing gradient conflicts and facilitating convergence.

In practice

Reduce visual tokens for VLM efficiency.
Apply VTP to sparse medical images.
Use dual-stream RL for complex VLM tasks.

Topics

Medical Imaging
Vision-Language Models
Reinforcement Learning
Token Pruning
Multimodal Reasoning
ViToS Framework

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.