ElevenLabs and Google dominate Artificial Analysis' updated speech-to-text benchmark

· Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

Artificial Analysis has released version 2.0 of its AA-WER speech-to-text benchmark, revealing ElevenLabs' Scribe v2 as the top performer with a word error rate (WER) of 2.3%. Google's Gemini 3 Pro followed closely at 2.9%, with Mistral's Voxtral Small achieving 3.0%. Google's Gemini 3 Flash (3.1%) and ElevenLabs' Scribe v1 (3.2%) also showed strong results. Notably, Google's strong performance stems from Gemini's general multimodal capabilities rather than specific transcription training. OpenAI's Whisper Large v3 recorded a 4.2% WER, while Alibaba's Qwen3 ASR Flash (5.9%), Amazon's Nova 2 Omni (6.0%), and Rev AI (6.1%) ranked lower. In the specialized AA-AgentTalk voice assistant test, Scribe v2 (1.6%) and Gemini 3 Pro (1.7%) again led, with AssemblyAI's Universal-3 Pro at 2.3%.

Key takeaway

For NLP Engineers evaluating speech-to-text solutions, ElevenLabs' Scribe v2 and Google's Gemini 3 Pro demonstrate leading performance in the latest AA-WER v2.0 benchmark. You should prioritize these models for applications requiring high transcription accuracy, especially for voice assistant interactions where they significantly outperform competitors. Consider Google's Gemini 3 Pro if your project also benefits from broader multimodal capabilities.

Key insights

ElevenLabs' Scribe v2 and Google's Gemini 3 Pro lead the latest speech-to-text benchmarks.

Principles

Method

The AA-WER v2.0 benchmark evaluates speech-to-text models using word error rate (WER). A separate AA-AgentTalk test assesses performance for voice assistant interactions.

In practice

Topics

Best for: NLP Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, AI Product Manager

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.