TAI #198: Real-Time Speech AI Gets Serious: Google and OpenAI Race to Own the Voice Layer

· Source: Towards AI Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Intermediate, long

Summary

The real-time speech AI landscape is rapidly advancing, with major releases from Google, OpenAI, and Cohere. Google's Gemini 3.1 Flash Live, launched March 26, is its highest-quality real-time audio model, supporting 70 languages, function calling, and multimodal input. It achieves 90.8% on ComplexFuncBench Audio and 95.9% on BigBenchAudio with extended thinking, though this adds latency. OpenAI's GPT-Realtime-1.5, released February 23, excels in conversational dynamics with a 0.82-second time-to-first-audio and improved alphanumeric transcription accuracy. Cohere Transcribe, an Apache 2.0-licensed ASR model, leads the Hugging Face Open ASR Leaderboard with a 5.42% WER. Pricing for real-time audio has fallen sharply, with Google's Gemini 3.1 Flash Live Preview costing approximately $0.023 per minute, significantly cheaper than OpenAI's $0.096 per minute. These advancements are enabling features like Google Live Translate, which now offers real-time headphone translation in 70+ languages on iOS.

Key takeaway

For CTOs and VP of Engineering evaluating real-time voice AI solutions, the rapid cost reduction and performance improvements mean these systems are now production-ready. You should prioritize models based on your specific needs: Google's Gemini 3.1 Flash Live for complex reasoning and multimodal input, or OpenAI's GPT-Realtime-1.5 for snappier conversational dynamics. Also, consider open-source options like Cohere Transcribe for high-accuracy ASR where data privacy or cost is paramount, and plan for human evaluation to address remaining challenges in voice naturalness and domain-specific accuracy.

Key insights

Real-time speech AI is maturing into deployable infrastructure with significant advancements in quality, speed, and cost.

Principles

Method

Google's Gemini 3.1 Flash Live uses WebSockets for full-duplex communication, supporting barge-in and simultaneous audio/video/transcript transmission, optimized for external tool triggering.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI Newsletter.