How OpenAI delivers low-latency voice AI at scale

2026-04-27 · Source: OpenAI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, long

Summary

OpenAI has rearchitected its WebRTC stack to deliver low-latency voice AI at scale for over 900 million weekly active users, addressing challenges like global reach, fast connection setup, and stable media round-trip time. The new "relay plus transceiver" architecture preserves standard WebRTC client behavior while optimizing internal packet routing. This design tackles constraints such as one-port-per-session media termination incompatibility with Kubernetes, the need for stable ownership of stateful ICE and DTLS sessions, and the requirement for low first-hop latency in global routing. Instead of an SFU model, OpenAI adopted a transceiver model where an edge service terminates client connections and converts media into simpler internal protocols for backend inference, allowing for continuous audio streaming and conversational AI experiences.

Key takeaway

For AI Engineers building real-time voice applications, adopting a split relay-transceiver WebRTC architecture can significantly improve scalability and reduce latency, especially when deploying on Kubernetes. This approach allows for a small, fixed public UDP footprint and deterministic first-packet routing, which is crucial for maintaining conversational fluidity and global reach. Consider encoding routing metadata into protocol-native fields like the ICE ufrag to simplify routing logic and enhance performance.

Key insights

OpenAI's WebRTC rearchitecture uses a split relay-transceiver model for scalable, low-latency voice AI.

Principles

Preserve protocol semantics at the edge.
Keep hard session states in one place.
Route on information already present in setup.

Method

The architecture splits packet routing (relay) from protocol termination (transceiver). The relay uses the ICE ufrag for first-packet routing to the owning transceiver, which handles all WebRTC session state.

In practice

Use `SO_REUSEPORT` for multiple workers on one UDP port.
Pin UDP-reading goroutines to OS threads for cache locality.
Employ pre-allocated buffers to minimize GC overhead.

Topics

WebRTC Architecture
Low-Latency Voice AI
Relay Transceiver Model
Kubernetes Deployment
ICE Ufrag Routing

Code references

l7mp/stunner

Best for: AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.