Advancing voice intelligence with new models in the API
Summary
OpenAI has released three new real-time voice models via its API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 offers GPT-5-class reasoning for natural, complex conversations, with features like preambles, parallel tool calls, improved recovery, a 128K context window, stronger domain understanding, and adjustable reasoning effort (minimal, low, medium, high, xhigh). It shows a 15.2% increase in audio intelligence over GPT-Realtime-1.5 on Big Bench Audio. GPT-Realtime-Translate provides live speech translation for over 70 input and 13 output languages, while GPT-Realtime-Whisper offers low-latency streaming speech-to-text. These models aim to enable advanced voice-to-action, systems-to-voice, and voice-to-voice applications, with pricing at $32/1M input tokens and $64/1M output tokens for GPT-Realtime-2, $0.034/minute for Translate, and $0.017/minute for Whisper.
Key takeaway
For Machine Learning Engineers building conversational AI, OpenAI's new Realtime API models offer significant advancements in voice intelligence. You should explore GPT-Realtime-2 for agents requiring complex reasoning and tool use, GPT-Realtime-Translate for live multilingual applications, and GPT-Realtime-Whisper for low-latency transcription. Consider the adjustable reasoning levels of GPT-Realtime-2 to balance performance and cost for your specific use cases, and integrate the provided safety guardrails.
Key insights
OpenAI's new real-time voice models enable intelligent, natural, and actionable voice interfaces for diverse applications.
Principles
- Voice agents require reasoning and context management.
- Real-time processing enhances natural conversation flow.
- Adjustable reasoning levels optimize latency and complexity.
Method
The models integrate reasoning, translation, and transcription capabilities, supporting features like parallel tool calls, context window expansion to 128K, and adjustable reasoning effort for dynamic voice interactions.
In practice
- Use GPT-Realtime-2 for complex conversational agents.
- Implement GPT-Realtime-Translate for live multilingual support.
- Leverage GPT-Realtime-Whisper for low-latency transcription.
Topics
- Realtime Voice Models
- GPT-Realtime-2
- GPT-Realtime-Translate
- GPT-Realtime-Whisper
- Voice AI API
Best for: Machine Learning Engineer, CTO, VP of Engineering/Data, AI Engineer, NLP Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.