New MAI models in Microsoft Foundry across text, image, voice, and speech
Summary
At Microsoft Build 2026, Microsoft announced new MAI models now available in Microsoft Foundry, expanding its first-party AI stack across text, image, voice, and speech modalities. MAI-Thinking-1, their first large language model, utilizes a Mixture-of-Experts architecture to deliver strong reasoning and math capabilities, matching Claude Opus 4.6 on SWE-Bench Pro at a substantially lower cost. The MAI-Image-2.5 family, including MAI-Image-2.5 Flash, offers updated image generation with image-to-image editing and "control with preservation" features, debuting at No. 3 on Arena.ai. For audio, MAI-Voice-2 is a multilingual text-to-speech model supporting 15+ languages with voice cloning and prompting, while MAI-Transcribe-1.5, a speech-to-text model, supports 43 languages, adds content biasing, and maintains its #1 spot on the FLEURS benchmark with a 3.7% Word Error Rate, proving 5x more efficient than competitors like Gemini 3.1 Flash. These models, already powering Microsoft products, are now accessible to developers.
Key takeaway
For AI Engineers and ML Directors building enterprise-scale applications, Microsoft's new MAI models in Foundry offer compelling options. You should evaluate MAI-Thinking-1 for cost-effective, complex reasoning workloads, especially given its MoE architecture. Consider MAI-Image-2.5 for creative workflows requiring precise image-to-image editing and brand consistency. Utilize MAI-Voice-2 for multilingual voice experiences and MAI-Transcribe-1.5 for highly accurate, specialized speech-to-text, particularly with its entity biasing feature. These models provide a robust, integrated stack for diverse AI development needs.
Key insights
Microsoft's new MAI models in Foundry offer advanced, cost-efficient AI capabilities across text, image, and audio modalities.
Principles
- MoE architectures scale capability without linear compute.
- Control with preservation enhances creative workflows.
- Multilingual identity preservation unifies voice experiences.
In practice
- Utilize MAI-Thinking-1 for complex, high-volume reasoning tasks.
- Apply MAI-Image-2.5 for consistent branded character generation.
- Implement MAI-Transcribe-1.5 with entity biasing for specialized transcription.
Topics
- Microsoft Foundry
- MAI Models
- Large Language Models
- Multimodal AI
- Speech Recognition
- Image Generation
Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.