TAI #198: Real-Time Speech AI Gets Serious: Google and OpenAI Race to Own the Voice Layer
Summary
The real-time speech AI landscape is rapidly advancing, with major releases from Google, OpenAI, and Cohere. Google's Gemini 3.1 Flash Live, launched March 26, is its highest-quality real-time audio model, supporting 70 languages, function calling, and multimodal input. It achieves 90.8% on ComplexFuncBench Audio and 95.9% on BigBenchAudio with extended thinking, though this adds latency. OpenAI's GPT-Realtime-1.5, released February 23, excels in conversational dynamics with a 0.82-second time-to-first-audio and improved alphanumeric transcription accuracy. Cohere Transcribe, an Apache 2.0-licensed ASR model, leads the Hugging Face Open ASR Leaderboard with a 5.42% WER. Pricing for real-time audio has fallen sharply, with Google's Gemini 3.1 Flash Live Preview costing approximately $0.023 per minute, significantly cheaper than OpenAI's $0.096 per minute. These advancements are enabling features like Google Live Translate, which now offers real-time headphone translation in 70+ languages on iOS.
Key takeaway
For CTOs and VP of Engineering evaluating real-time voice AI solutions, the rapid cost reduction and performance improvements mean these systems are now production-ready. You should prioritize models based on your specific needs: Google's Gemini 3.1 Flash Live for complex reasoning and multimodal input, or OpenAI's GPT-Realtime-1.5 for snappier conversational dynamics. Also, consider open-source options like Cohere Transcribe for high-accuracy ASR where data privacy or cost is paramount, and plan for human evaluation to address remaining challenges in voice naturalness and domain-specific accuracy.
Key insights
Real-time speech AI is maturing into deployable infrastructure with significant advancements in quality, speed, and cost.
Principles
- Reasoning vs. latency is a critical engineering trade-off.
- Multilingual voice interaction is becoming a default capability.
- Open-source ASR models can outperform proprietary solutions.
Method
Google's Gemini 3.1 Flash Live uses WebSockets for full-duplex communication, supporting barge-in and simultaneous audio/video/transcript transmission, optimized for external tool triggering.
In practice
- Evaluate real-time models for specific latency-reasoning needs.
- Consider open-source ASR for on-premise or cost-sensitive applications.
- Utilize WebRTC, WebSocket, or SIP for diverse voice application stacks.
Topics
- Real-Time Speech AI
- Gemini 3.1 Flash Live
- GPT-Realtime-1.5
- Automatic Speech Recognition
- AI Agent Architectures
Code references
- obra/superpowers
- A-EVO-Lab/a-evolve
- agent-infra/sandbox
- NVIDIA-NeMo/ProRL-Agent-Server
- Tencent/Covo-Audio
Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI Newsletter.