Introducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Microsoft Foundry
Summary
Microsoft Foundry has announced the public preview of three new Microsoft AI models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. MAI-Transcribe-1 is a speech recognition model offering enterprise-grade accuracy across 25 languages at approximately 50% lower GPU cost than alternatives, ranking 1st on overall WER on the FLEURS benchmark for 11 core languages. MAI-Voice-1 is a high-fidelity speech generation model capable of producing 60 seconds of expressive audio in under one second on a single GPU. MAI-Image-2 is a text-to-image model that debuted at #3 on the Arena.ai leaderboard, excelling in photorealistic generation, in-image text rendering, and complex layouts. These models currently power Microsoft products like Copilot, Bing, and PowerPoint, and are now available to developers through Foundry and Azure Speech.
Key takeaway
For engineering leaders evaluating AI model integration, these new Microsoft AI models offer a compelling balance of performance and cost efficiency. MAI-Transcribe-1's lower GPU cost and MAI-Voice-1's rapid generation can significantly reduce operational expenses for voice-driven applications, while MAI-Image-2 provides advanced creative capabilities. Consider piloting these models in your next project to assess their impact on both performance benchmarks and infrastructure costs, especially for high-volume multimedia processing.
Key insights
Microsoft Foundry introduces three new AI models for efficient, high-quality speech recognition, speech generation, and text-to-image capabilities.
Principles
- Efficiency drives scalability and predictable enterprise pricing.
- First-party audio AI stacks enhance developer control.
- Collaboration improves creative workflow integration.
Method
Developers can access MAI-Transcribe-1 and MAI-Voice-1 via Azure Speech, with MAI-Transcribe-1 priced at $0.36 USD/hour and MAI-Voice-1 at $22 USD/1M characters. MAI-Image-2 is available via API, costing $5 USD/1M tokens for text input and $33 USD/1M tokens for image output.
In practice
- Use MAI-Transcribe-1 for real-time IVR transcription.
- Generate custom voices with MAI-Voice-1's Personal Voice feature.
- Integrate MAI-Image-2 for enterprise branding visuals.
Topics
- MAI-Transcribe-1
- MAI-Voice-1
- MAI-Image-2
- Microsoft Foundry
- Azure Speech
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.