[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs
Summary
OpenAI has released three new streaming audio models in its Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is positioned as OpenAI's "most intelligent voice model yet," offering "GPT-5-class reasoning" for real-time voice agents with enhanced capabilities like handling interruptions, using tools, and sustaining longer conversations. It features a 128K context window, up from 32K, and adjustable reasoning levels (minimal, low, medium, high, xhigh). GPT-Realtime-Translate supports live speech translation from over 70 input languages into 13 output languages, while GPT-Realtime-Whisper provides low-latency streaming transcription. Benchmarks show significant improvements, with GPT-Realtime-2 achieving 96.6% on Big Bench Audio speech-to-speech reasoning and a 70.8% APR instruction retention on Scale AI's Audio MultiChallenge S2S, up from 36.7% for GPT-Realtime-1.5. Pricing remains $1.15/hour for audio input and $4.61/hour for audio output.
Key takeaway
For AI Architects and Product Managers designing conversational AI, OpenAI's new Realtime API models necessitate a shift from simple prompt-response systems to stateful, real-time architectures. Focus on designing robust harness logic for latency, interruption handling, tool-call UX, and conversational memory to fully capitalize on the enhanced reasoning and context capabilities of GPT-Realtime-2, ensuring your voice agents deliver a superior, more human-like experience.
Key insights
New OpenAI models enable full-duplex, tool-using, long-context, reasoning voice agents for real-time applications.
Principles
- Voice agents require stateful real-time system design.
- Agent quality depends on harness design, not just model selection.
Method
Developers can tune reasoning effort, manage preambles, define tool behavior, handle unclear audio, capture entities, and maintain state for long conversational sessions.
In practice
- Implement preambles for smoother agent responses.
- Enable audible transparency during tool calls.
- Utilize 128K context for longer, more complex interactions.
Topics
- OpenAI Realtime API
- GPT-Realtime-2
- Live Speech Translation
- Streaming Transcription
- Real-time Voice Agents
Best for: CTO, AI Architect, AI Product Manager, AI Engineer, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.