GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs
Summary
OpenAI has launched three new real-time voice APIs: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is positioned as OpenAI's most intelligent voice model, offering "GPT-5-class reasoning" for real-time voice agents, with capabilities like tool use, interruption handling, and longer conversations. It features an expanded 128K context window, up from 32K, and maintains audio pricing at $1.15/hour input and $4.61/hour output. Benchmarks show significant improvements, with GPT-Realtime-2 scoring 96.6% on Big Bench Audio and achieving 70.8% APR on Scale AI's Audio MultiChallenge S2S for instruction retention. GPT-Realtime-Translate supports live speech translation across 70+ input languages to 13 output languages, while GPT-Realtime-Whisper provides low-latency streaming transcription. These models are available in the Realtime API, with ChatGPT voice upgrades pending.
Key takeaway
For CTOs and VP of Engineering evaluating real-time voice agent solutions, OpenAI's new GPT-Realtime-2, -Translate, and -Whisper models represent a significant leap in capability. Your teams should explore integrating these APIs to build more intelligent, responsive, and context-aware voice applications, particularly for customer support, live translation, and hands-free workflows. Be prepared to design stateful real-time systems to fully capitalize on features like 128K context and advanced interruption handling.
Key insights
OpenAI's new Realtime API models significantly advance voice AI with enhanced reasoning, context, and real-time capabilities.
Principles
- Voice agents require stateful, real-time system design.
- Longer context windows improve conversational AI performance.
- Tool transparency enhances user experience in AI interactions.
Method
OpenAI's voice models integrate adjustable reasoning effort, preambles, parallel tool calls, and robust recovery behaviors to manage complex, real-time voice interactions effectively.
In practice
- Implement preambles for smoother agent responses.
- Utilize adjustable reasoning levels for cost/latency optimization.
- Design for audible tool transparency in voice agents.
Topics
- Real-time Voice AI
- GPT-Realtime-2
- Speech-to-Speech Translation
- LLM Quantization
- Model Interpretability
Code references
Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AINews.