TAI #198: Real-Time Speech AI Gets Serious: Google and OpenAI Race to Own the Voice Layer

2026-03-31 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Intermediate, long

Summary

The real-time speech AI landscape is rapidly advancing, with Google, OpenAI, and Cohere releasing significant models. Google's Gemini 3.1 Flash Live, launched March 26, is their highest-quality real-time audio model, supporting 70 languages, function calling, and multimodal input. It leads on complex function calling (90.8% on ComplexFuncBench Audio) and reasoning (95.9% on BigBenchAudio with high thinking), though extended thinking adds latency. OpenAI's GPT-Realtime-1.5, released February 23, excels in conversational dynamics (95.7% Conversational Dynamics score) and offers faster time-to-first-audio (0.82 seconds). It also improved alphanumeric transcription accuracy by 10.23% and supports WebRTC, WebSocket, and SIP. Cohere Transcribe, an Apache 2.0-licensed ASR model, achieved a 5.42% Word Error Rate on the Hugging Face Open ASR Leaderboard, processing audio at 525x real-time. Pricing for real-time audio has fallen sharply, with Google's Gemini 3.1 Flash Live Preview costing approximately $0.023 per minute, making it about 4.2x cheaper than OpenAI's $0.096 per minute for two-way audio.

Key takeaway

For AI Architects designing voice-first agents or multilingual interaction systems, the rapid cost reduction and performance gains in real-time speech AI necessitate re-evaluating existing solutions. Prioritize models like Google's Gemini 3.1 Flash Live for complex reasoning or OpenAI's GPT-Realtime-1.5 for conversational fluidity, balancing latency requirements. Explore open-source options like Cohere Transcribe for high-accuracy, cost-effective ASR, especially for regulated industries, to integrate advanced voice capabilities into your product roadmap within the next 12-18 months.

Key insights

Real-time speech AI is maturing into deployable infrastructure with significant advancements in quality, speed, and cost efficiency.

Principles

Reasoning often trades off with latency in real-time AI.
Multilingual voice interaction is becoming a default capability.
Open-source ASR models can outperform proprietary solutions.

In practice

Evaluate real-time AI models based on specific latency needs.
Consider open-source ASR for sensitive enterprise data.
Split RAG evaluation into retrieval and generation layers.

Topics

Real-Time Speech AI
Gemini 3.1 Flash Live
GPT-Realtime-1.5
Cohere Transcribe
Voice AI Benchmarks

Code references

Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.