TAI #198: Real-Time Speech AI Gets Serious: Google and OpenAI Race to Own the Voice Layer

2024-09-10 · Source: Towards AI Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Intermediate, long

Summary

The real-time speech AI landscape is rapidly advancing, with major releases from Google, OpenAI, and Cohere. Google's Gemini 3.1 Flash Live, launched March 26, is its highest-quality real-time audio model, supporting 70 languages, function calling, and multimodal input. It achieves 90.8% on ComplexFuncBench Audio and 95.9% on BigBenchAudio with extended thinking, though this adds latency. OpenAI's GPT-Realtime-1.5, released February 23, excels in conversational dynamics with a 0.82-second time-to-first-audio and improved alphanumeric transcription accuracy. Cohere Transcribe, an Apache 2.0-licensed ASR model, leads the Hugging Face Open ASR Leaderboard with a 5.42% WER. Pricing for real-time audio has fallen sharply, with Google's Gemini 3.1 Flash Live Preview costing approximately $0.023 per minute, significantly cheaper than OpenAI's $0.096 per minute. These advancements are enabling features like Google Live Translate, which now offers real-time headphone translation in 70+ languages on iOS.

Key takeaway

For CTOs and VP of Engineering evaluating real-time voice AI solutions, the rapid cost reduction and performance improvements mean these systems are now production-ready. You should prioritize models based on your specific needs: Google's Gemini 3.1 Flash Live for complex reasoning and multimodal input, or OpenAI's GPT-Realtime-1.5 for snappier conversational dynamics. Also, consider open-source options like Cohere Transcribe for high-accuracy ASR where data privacy or cost is paramount, and plan for human evaluation to address remaining challenges in voice naturalness and domain-specific accuracy.

Key insights

Real-time speech AI is maturing into deployable infrastructure with significant advancements in quality, speed, and cost.

Principles

Reasoning vs. latency is a critical engineering trade-off.
Multilingual voice interaction is becoming a default capability.
Open-source ASR models can outperform proprietary solutions.

Method

Google's Gemini 3.1 Flash Live uses WebSockets for full-duplex communication, supporting barge-in and simultaneous audio/video/transcript transmission, optimized for external tool triggering.

In practice

Evaluate real-time models for specific latency-reasoning needs.
Consider open-source ASR for on-premise or cost-sensitive applications.
Utilize WebRTC, WebSocket, or SIP for diverse voice application stacks.

Topics

Real-Time Speech AI
Gemini 3.1 Flash Live
GPT-Realtime-1.5
Automatic Speech Recognition
AI Agent Architectures

Code references

Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI Newsletter.