How OpenAI delivers low-latency voice AI at scale
TL;DR Highlight
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Who Should Read
Backend/infrastructure developers aiming to add real-time voice/audio features to apps, or developers struggling with port management or routing issues while operating WebRTC in a Kubernetes environment.
Core Mechanics
- OpenAI chose WebRTC because it’s a standardized protocol already implemented in browsers, mobile devices, and servers, eliminating the need to implement low-level processing like ICE (NAT traversal), DTLS/SRTP (encrypted transmission), codec negotiation, RTCP (quality control), and echo cancellation/jitter buffering.
- The most crucial characteristic of voice AI is that audio arrives as a continuous stream, allowing the model to simultaneously transcribe, infer, call tools, and generate speech while the user is speaking – creating the difference between a ‘conversational’ and a ‘push-to-talk’ feel.
- The traditional WebRTC server approach, SFU (Selective Forwarding Unit), requires opening a separate port for each session, and this ‘one port per session’ model was a core problem colliding with Kubernetes at OpenAI’s scale, making horizontal scaling difficult due to stateful ICE/DTLS sessions needing to be pinned to specific nodes.
- To solve this, OpenAI designed a relay + transceiver split architecture, placing relays at the global edge to minimize first-hop latency to clients, while transceivers handle actual media processing and model connections within the internal infrastructure.
- Clients experience standard WebRTC behavior, while the underlying packet routing is completely changed, using ICE credentials to route to the correct transceiver and maintain stateful sessions.
- Combining global relays with geostering (automatic routing based on user location) ensures that connections are routed to the nearest relay worldwide, which is critical for maintaining low latency at a scale of 900 million users.
- The implementation leveraged the open-source Go WebRTC library Pion (https://github.com/pion/webrtc), and Pion’s creator, Sean DuBois, has since joined OpenAI.
- Currently, the Realtime API’s voice models are limited to the GPT-4o family, meaning the model’s capabilities aren’t at the level of the latest frontier models despite the architectural improvements.
Evidence
- "Pion library developers commented thanking OpenAI for publicly acknowledging its use and recommended 'WebRTC for the Curious (webrtcforthecurious.com)' as a WebRTC introductory resource. A WebRTC + Kubernetes game streaming product veteran strongly disagreed, arguing that the problems OpenAI described were mostly issues with the libwebrtc implementation, and that proper feature flag configuration could reduce latency without paid network workarounds. Users shared experiences where low latency itself created UX problems, with the system incorrectly interpreting pauses as turn endings. OpenAI mentioned their open-source Voice AI pipeline framework pipecat (https://github.com/pipecat-ai/pipecat), with comments recommending it as a good starting point. Questions arose about whether OpenAI replaced LiveKit with a custom WebRTC stack, but the architecture explanation itself implied a custom build."
How to Apply
- If you’re running WebRTC servers in Kubernetes and facing scale-out limitations due to the one-port-per-session problem, consider redesigning your architecture with a relay (edge, stateless) and transceiver (internal, stateful) split, routing based on ICE credentials.
- To quickly prototype real-time voice AI services, explore pipecat (https://github.com/pipecat-ai/pipecat) or Pion (https://github.com/pion/webrtc) before implementing a WebRTC stack from scratch, allowing you to start quickly without low-level implementation.
- When implementing ‘end-of-turn detection’ logic for Voice AI, avoid relying solely on silence timers, as they can prematurely cut off users pausing to find a word; instead, make the silence threshold user-adjustable or design separate logic to distinguish mid-utterance pauses from turn endings.
- If you’re operating WebRTC based on libwebrtc, consider checking feature flag settings, as latency issues may be solvable through configuration before resorting to paid network solutions or complex infrastructure changes.
Terminology
Related Papers
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.
Show HN: GoModel – an open-source AI gateway in Go
GoModel unifies access to OpenAI, Anthropic, Gemini, and other AI providers through a single, OpenAI-compatible API, offering a compiled-language alternative to LiteLLM.
Claude Token Counter, now with model comparisons
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
An agent optimization technique that achieves 74% of GPT-4o performance with only 23.9% of the cost by starting with SLM and switching to GPT-4 if failure is predicted.
€54k spike in 13h from unrestricted Firebase browser key accessing Gemini APIs
This is a real-world case where an unlimited API key activated for Firebase AI Logic (Gemini API) was exploited in automated attacks, resulting in €54,000 in charges within 13 hours, and Google refused a refund. It serves as a warning about the dangers of exposing API keys on the client side.