New MAI models in Microsoft Foundry across text, image, voice, and speech

· Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

At Microsoft Build 2026, Microsoft announced new MAI models now available in Microsoft Foundry, expanding its first-party AI stack across text, image, voice, and speech modalities. MAI-Thinking-1, their first large language model, utilizes a Mixture-of-Experts architecture to deliver strong reasoning and math capabilities, matching Claude Opus 4.6 on SWE-Bench Pro at a substantially lower cost. The MAI-Image-2.5 family, including MAI-Image-2.5 Flash, offers updated image generation with image-to-image editing and "control with preservation" features, debuting at No. 3 on Arena.ai. For audio, MAI-Voice-2 is a multilingual text-to-speech model supporting 15+ languages with voice cloning and prompting, while MAI-Transcribe-1.5, a speech-to-text model, supports 43 languages, adds content biasing, and maintains its #1 spot on the FLEURS benchmark with a 3.7% Word Error Rate, proving 5x more efficient than competitors like Gemini 3.1 Flash. These models, already powering Microsoft products, are now accessible to developers.

Key takeaway

For AI Engineers and ML Directors building enterprise-scale applications, Microsoft's new MAI models in Foundry offer compelling options. You should evaluate MAI-Thinking-1 for cost-effective, complex reasoning workloads, especially given its MoE architecture. Consider MAI-Image-2.5 for creative workflows requiring precise image-to-image editing and brand consistency. Utilize MAI-Voice-2 for multilingual voice experiences and MAI-Transcribe-1.5 for highly accurate, specialized speech-to-text, particularly with its entity biasing feature. These models provide a robust, integrated stack for diverse AI development needs.

Key insights

Microsoft's new MAI models in Foundry offer advanced, cost-efficient AI capabilities across text, image, and audio modalities.

Principles

In practice

Topics

Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.