Introducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Microsoft Foundry

· Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

Microsoft Foundry has announced the public preview of three new Microsoft AI models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. MAI-Transcribe-1 is a speech recognition model offering enterprise-grade accuracy across 25 languages at approximately 50% lower GPU cost than alternatives, ranking 1st on overall WER on the FLEURS benchmark for 11 core languages. MAI-Voice-1 is a high-fidelity speech generation model capable of producing 60 seconds of expressive audio in under one second on a single GPU. MAI-Image-2 is a text-to-image model that debuted at #3 on the Arena.ai leaderboard, excelling in photorealistic generation, in-image text rendering, and complex layouts. These models currently power Microsoft products like Copilot, Bing, and PowerPoint, and are now available to developers through Foundry and Azure Speech.

Key takeaway

For engineering leaders evaluating AI model integration, these new Microsoft AI models offer a compelling balance of performance and cost efficiency. MAI-Transcribe-1's lower GPU cost and MAI-Voice-1's rapid generation can significantly reduce operational expenses for voice-driven applications, while MAI-Image-2 provides advanced creative capabilities. Consider piloting these models in your next project to assess their impact on both performance benchmarks and infrastructure costs, especially for high-volume multimedia processing.

Key insights

Microsoft Foundry introduces three new AI models for efficient, high-quality speech recognition, speech generation, and text-to-image capabilities.

Principles

Method

Developers can access MAI-Transcribe-1 and MAI-Voice-1 via Azure Speech, with MAI-Transcribe-1 priced at $0.36 USD/hour and MAI-Voice-1 at $22 USD/1M characters. MAI-Image-2 is available via API, costing $5 USD/1M tokens for text input and $33 USD/1M tokens for image output.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.