New MAI models in Microsoft Foundry across text, image, voice, and speech

2026-06-02 · Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

At Microsoft Build 2026, Microsoft announced new MAI models now available in Microsoft Foundry, expanding its first-party AI stack across text, image, voice, and speech modalities. MAI-Thinking-1, their first large language model, utilizes a Mixture-of-Experts architecture to deliver strong reasoning and math capabilities, matching Claude Opus 4.6 on SWE-Bench Pro at a substantially lower cost. The MAI-Image-2.5 family, including MAI-Image-2.5 Flash, offers updated image generation with image-to-image editing and "control with preservation" features, debuting at No. 3 on Arena.ai. For audio, MAI-Voice-2 is a multilingual text-to-speech model supporting 15+ languages with voice cloning and prompting, while MAI-Transcribe-1.5, a speech-to-text model, supports 43 languages, adds content biasing, and maintains its #1 spot on the FLEURS benchmark with a 3.7% Word Error Rate, proving 5x more efficient than competitors like Gemini 3.1 Flash. These models, already powering Microsoft products, are now accessible to developers.

Key takeaway

For AI Engineers and ML Directors building enterprise-scale applications, Microsoft's new MAI models in Foundry offer compelling options. You should evaluate MAI-Thinking-1 for cost-effective, complex reasoning workloads, especially given its MoE architecture. Consider MAI-Image-2.5 for creative workflows requiring precise image-to-image editing and brand consistency. Utilize MAI-Voice-2 for multilingual voice experiences and MAI-Transcribe-1.5 for highly accurate, specialized speech-to-text, particularly with its entity biasing feature. These models provide a robust, integrated stack for diverse AI development needs.

Key insights

Microsoft's new MAI models in Foundry offer advanced, cost-efficient AI capabilities across text, image, and audio modalities.

Principles

MoE architectures scale capability without linear compute.
Control with preservation enhances creative workflows.
Multilingual identity preservation unifies voice experiences.

In practice

Utilize MAI-Thinking-1 for complex, high-volume reasoning tasks.
Apply MAI-Image-2.5 for consistent branded character generation.
Implement MAI-Transcribe-1.5 with entity biasing for specialized transcription.

Topics

Microsoft Foundry
MAI Models
Large Language Models
Multimodal AI
Speech Recognition
Image Generation

Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.