How OpenAI delivers low-latency voice AI at scale
Summary
OpenAI has rearchitected its WebRTC stack to deliver low-latency voice AI at scale for over 900 million weekly active users, addressing challenges like global reach, fast connection setup, and stable media round-trip time. The new "relay plus transceiver" architecture preserves standard WebRTC client behavior while optimizing internal packet routing. This design tackles constraints such as one-port-per-session media termination incompatibility with Kubernetes, the need for stable ownership of stateful ICE and DTLS sessions, and the requirement for low first-hop latency in global routing. Instead of an SFU model, OpenAI adopted a transceiver model where an edge service terminates client connections and converts media into simpler internal protocols for backend inference, allowing for continuous audio streaming and conversational AI experiences.
Key takeaway
For AI Engineers building real-time voice applications, adopting a split relay-transceiver WebRTC architecture can significantly improve scalability and reduce latency, especially when deploying on Kubernetes. This approach allows for a small, fixed public UDP footprint and deterministic first-packet routing, which is crucial for maintaining conversational fluidity and global reach. Consider encoding routing metadata into protocol-native fields like the ICE ufrag to simplify routing logic and enhance performance.
Key insights
OpenAI's WebRTC rearchitecture uses a split relay-transceiver model for scalable, low-latency voice AI.
Principles
- Preserve protocol semantics at the edge.
- Keep hard session states in one place.
- Route on information already present in setup.
Method
The architecture splits packet routing (relay) from protocol termination (transceiver). The relay uses the ICE ufrag for first-packet routing to the owning transceiver, which handles all WebRTC session state.
In practice
- Use `SO_REUSEPORT` for multiple workers on one UDP port.
- Pin UDP-reading goroutines to OS threads for cache locality.
- Employ pre-allocated buffers to minimize GC overhead.
Topics
- WebRTC Architecture
- Low-Latency Voice AI
- Relay Transceiver Model
- Kubernetes Deployment
- ICE Ufrag Routing
Code references
Best for: AI Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.