TAI #198: Real-Time Speech AI Gets Serious: Google and OpenAI Race to Own the Voice Layer
Summary
The real-time speech AI landscape is rapidly advancing, with Google, OpenAI, and Cohere releasing significant models. Google's Gemini 3.1 Flash Live, launched March 26, is their highest-quality real-time audio model, supporting 70 languages, function calling, and multimodal input. It leads on complex function calling (90.8% on ComplexFuncBench Audio) and reasoning (95.9% on BigBenchAudio with high thinking), though extended thinking adds latency. OpenAI's GPT-Realtime-1.5, released February 23, excels in conversational dynamics (95.7% Conversational Dynamics score) and offers faster time-to-first-audio (0.82 seconds). It also improved alphanumeric transcription accuracy by 10.23% and supports WebRTC, WebSocket, and SIP. Cohere Transcribe, an Apache 2.0-licensed ASR model, achieved a 5.42% Word Error Rate on the Hugging Face Open ASR Leaderboard, processing audio at 525x real-time. Pricing for real-time audio has fallen sharply, with Google's Gemini 3.1 Flash Live Preview costing approximately $0.023 per minute, making it about 4.2x cheaper than OpenAI's $0.096 per minute for two-way audio.
Key takeaway
For AI Architects designing voice-first agents or multilingual interaction systems, the rapid cost reduction and performance gains in real-time speech AI necessitate re-evaluating existing solutions. Prioritize models like Google's Gemini 3.1 Flash Live for complex reasoning or OpenAI's GPT-Realtime-1.5 for conversational fluidity, balancing latency requirements. Explore open-source options like Cohere Transcribe for high-accuracy, cost-effective ASR, especially for regulated industries, to integrate advanced voice capabilities into your product roadmap within the next 12-18 months.
Key insights
Real-time speech AI is maturing into deployable infrastructure with significant advancements in quality, speed, and cost efficiency.
Principles
- Reasoning often trades off with latency in real-time AI.
- Multilingual voice interaction is becoming a default capability.
- Open-source ASR models can outperform proprietary solutions.
In practice
- Evaluate real-time AI models based on specific latency needs.
- Consider open-source ASR for sensitive enterprise data.
- Split RAG evaluation into retrieval and generation layers.
Topics
- Real-Time Speech AI
- Gemini 3.1 Flash Live
- GPT-Realtime-1.5
- Cohere Transcribe
- Voice AI Benchmarks
Code references
- obra/superpowers
- A-EVO-Lab/a-evolve
- agent-infra/sandbox
- NVIDIA-NeMo/ProRL-Agent-Server
- Tencent/Covo-Audio
Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.