GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

2026-05-07 · Source: AINews · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

OpenAI has launched three new real-time voice APIs: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is positioned as OpenAI's most intelligent voice model, offering "GPT-5-class reasoning" for real-time voice agents, with capabilities like tool use, interruption handling, and longer conversations. It features an expanded 128K context window, up from 32K, and maintains audio pricing at $1.15/hour input and $4.61/hour output. Benchmarks show significant improvements, with GPT-Realtime-2 scoring 96.6% on Big Bench Audio and achieving 70.8% APR on Scale AI's Audio MultiChallenge S2S for instruction retention. GPT-Realtime-Translate supports live speech translation across 70+ input languages to 13 output languages, while GPT-Realtime-Whisper provides low-latency streaming transcription. These models are available in the Realtime API, with ChatGPT voice upgrades pending.

Key takeaway

For CTOs and VP of Engineering evaluating real-time voice agent solutions, OpenAI's new GPT-Realtime-2, -Translate, and -Whisper models represent a significant leap in capability. Your teams should explore integrating these APIs to build more intelligent, responsive, and context-aware voice applications, particularly for customer support, live translation, and hands-free workflows. Be prepared to design stateful real-time systems to fully capitalize on features like 128K context and advanced interruption handling.

Key insights

OpenAI's new Realtime API models significantly advance voice AI with enhanced reasoning, context, and real-time capabilities.

Principles

Voice agents require stateful, real-time system design.
Longer context windows improve conversational AI performance.
Tool transparency enhances user experience in AI interactions.

Method

OpenAI's voice models integrate adjustable reasoning effort, preambles, parallel tool calls, and robust recovery behaviors to manage complex, real-time voice interactions effectively.

In practice

Implement preambles for smoother agent responses.
Utilize adjustable reasoning levels for cost/latency optimization.
Design for audible tool transparency in voice agents.

Topics

Real-time Voice AI
GPT-Realtime-2
Speech-to-Speech Translation
LLM Quantization
Model Interpretability

Code references

ggml-org/llama.cpp

Best for: CTO, VP of Engineering/Data, NLP Engineer, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AINews.