Voxtral transcribes at the speed of sound

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

Mistral has released Voxtral Transcribe 2, a new family of two audio-to-text transcription models, following the original Voxtral released in July 2025. One model, `Voxtral-Mini-4B-Realtime-2602`, is open-weights under an Apache-2.0 license and available as an 8.87GB download from Hugging Face, offering real-time transcription capabilities. The second model, `voxtral-mini-latest`, is closed-weights and accessible via the Mistral API at a cost of $0.003 per minute, or $0.18 per hour. The Mistral API console now includes a speech-to-text playground for testing the `voxtral-mini-latest` model, providing diarized transcripts with options to download results in text, SRT, or JSON formats.

Key takeaway

For engineering teams evaluating speech-to-text solutions, consider Mistral's Voxtral Transcribe 2. Your choice between the open-weights `Voxtral-Mini-4B-Realtime-2602` and the API-based `voxtral-mini-latest` should depend on whether your priority is local deployment and cost control or leveraging advanced features like diarization and context biasing through a managed service. Test the API playground to assess its transcription quality for your specific audio data.

Key insights

Mistral's Voxtral Transcribe 2 offers both open-weights and API-based speech-to-text models with real-time and diarization features.

Principles

Method

The Mistral API allows audio transcription with diarization, context biasing, and timestamp granularities via a POST request to `/v1/audio/transcriptions`.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.