[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

2026-05-08 · Source: Latent.Space - Www.latent.space · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

OpenAI has released three new streaming audio models in its Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is positioned as OpenAI's "most intelligent voice model yet," offering "GPT-5-class reasoning" for real-time voice agents with enhanced capabilities like handling interruptions, using tools, and sustaining longer conversations. It features a 128K context window, up from 32K, and adjustable reasoning levels (minimal, low, medium, high, xhigh). GPT-Realtime-Translate supports live speech translation from over 70 input languages into 13 output languages, while GPT-Realtime-Whisper provides low-latency streaming transcription. Benchmarks show significant improvements, with GPT-Realtime-2 achieving 96.6% on Big Bench Audio speech-to-speech reasoning and a 70.8% APR instruction retention on Scale AI's Audio MultiChallenge S2S, up from 36.7% for GPT-Realtime-1.5. Pricing remains $1.15/hour for audio input and $4.61/hour for audio output.

Key takeaway

For AI Architects and Product Managers designing conversational AI, OpenAI's new Realtime API models necessitate a shift from simple prompt-response systems to stateful, real-time architectures. Focus on designing robust harness logic for latency, interruption handling, tool-call UX, and conversational memory to fully capitalize on the enhanced reasoning and context capabilities of GPT-Realtime-2, ensuring your voice agents deliver a superior, more human-like experience.

Key insights

New OpenAI models enable full-duplex, tool-using, long-context, reasoning voice agents for real-time applications.

Principles

Voice agents require stateful real-time system design.
Agent quality depends on harness design, not just model selection.

Method

Developers can tune reasoning effort, manage preambles, define tool behavior, handle unclear audio, capture entities, and maintain state for long conversational sessions.

In practice

Implement preambles for smoother agent responses.
Enable audible transparency during tool calls.
Utilize 128K context for longer, more complex interactions.

Topics

OpenAI Realtime API
GPT-Realtime-2
Live Speech Translation
Streaming Transcription
Real-time Voice Agents

Best for: CTO, AI Architect, AI Product Manager, AI Engineer, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.