Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
Summary
Microsoft Research has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model, now available via Microsoft Foundry, HuggingFace, and GitHub. This model is designed for a wide array of vision-language tasks, including image captioning, visual question answering, document reading, and understanding user interfaces. It particularly excels at math and science reasoning. Phi-4-reasoning-vision-15B offers a compelling trade-off between accuracy and compute costs, demonstrating competitive performance against models requiring significantly more compute time and tokens, while achieving higher accuracy than similarly fast models, especially in math and science. The model was trained with 200 billion tokens of multimodal data, substantially less than other large multimodal models, by leveraging careful architecture choices, rigorous data curation, and a mixed reasoning/non-reasoning data approach.
Key takeaway
For AI Scientists and Research Scientists developing multimodal systems, Phi-4-reasoning-vision-15B offers a robust, efficient, and open-weight option. Your teams should consider integrating this 15B parameter model, especially for applications requiring strong math, science, or UI interaction capabilities, as it provides a favorable balance of accuracy and compute efficiency compared to larger, slower alternatives. Explore its performance on your specific high-resolution visual tasks and leverage its mixed reasoning approach for optimized latency.
Key insights
Efficient multimodal reasoning models can achieve high performance with less data and compute through careful design and data curation.
Principles
- Mid-fusion architectures balance performance and resource constraints.
- Dynamic resolution vision encoders enhance high-resolution data processing.
- High-quality, balanced data is crucial for multimodal model performance.
Method
The model uses a mid-fusion architecture with a SigLIP-2 Naflex variant vision encoder and a Phi-4-Reasoning backbone. It employs a mixed reasoning/non-reasoning training approach with 20% reasoning data, enabling task-aware inference.
In practice
- Use Phi-4-reasoning-vision-15B for math and science reasoning tasks.
- Apply the model for computer-using agent (CUA) scenarios.
- Utilize explicit prompting with "<reasoning>" or "<direct>" tokens.
Topics
- Phi-4-reasoning-vision-15B
- Multimodal Reasoning
- Efficient VLM Training
- Data Curation
- Computer-Using Agents
Code references
Best for: AI Scientist, Research Scientist, Computer Vision Engineer, AI Researcher, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.