Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

2026-03-04 · Source: Microsoft Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

Microsoft Research has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model, now available via Microsoft Foundry, HuggingFace, and GitHub. This model is designed for a wide array of vision-language tasks, including image captioning, visual question answering, document reading, and understanding user interfaces. It particularly excels at math and science reasoning. Phi-4-reasoning-vision-15B offers a compelling trade-off between accuracy and compute costs, demonstrating competitive performance against models requiring significantly more compute time and tokens, while achieving higher accuracy than similarly fast models, especially in math and science. The model was trained with 200 billion tokens of multimodal data, substantially less than other large multimodal models, by leveraging careful architecture choices, rigorous data curation, and a mixed reasoning/non-reasoning data approach.

Key takeaway

For AI Scientists and Research Scientists developing multimodal systems, Phi-4-reasoning-vision-15B offers a robust, efficient, and open-weight option. Your teams should consider integrating this 15B parameter model, especially for applications requiring strong math, science, or UI interaction capabilities, as it provides a favorable balance of accuracy and compute efficiency compared to larger, slower alternatives. Explore its performance on your specific high-resolution visual tasks and leverage its mixed reasoning approach for optimized latency.

Key insights

Efficient multimodal reasoning models can achieve high performance with less data and compute through careful design and data curation.

Principles

Mid-fusion architectures balance performance and resource constraints.
Dynamic resolution vision encoders enhance high-resolution data processing.
High-quality, balanced data is crucial for multimodal model performance.

Method

The model uses a mid-fusion architecture with a SigLIP-2 Naflex variant vision encoder and a Phi-4-Reasoning backbone. It employs a mixed reasoning/non-reasoning training approach with 20% reasoning data, enabling task-aware inference.

In practice

Use Phi-4-reasoning-vision-15B for math and science reasoning tasks.
Apply the model for computer-using agent (CUA) scenarios.
Utilize explicit prompting with "<reasoning>" or "<direct>" tokens.

Topics

Phi-4-reasoning-vision-15B
Multimodal Reasoning
Efficient VLM Training
Data Curation
Computer-Using Agents

Code references

Best for: AI Scientist, Research Scientist, Computer Vision Engineer, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.