How OpenAI delivers low-latency voice AI at scale
TL;DR Highlight
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Who Should Read
Backend/infrastructure developers aiming to add real-time voice/audio features to apps, or developers struggling with port management or routing issues while operating WebRTC in a Kubernetes environment.
Core Mechanics
- OpenAI chose WebRTC because it’s a standardized protocol already implemented in browsers, mobile devices, and servers, eliminating the need to implement low-level processing like ICE (NAT traversal), DTLS/SRTP (encrypted transmission), codec negotiation, RTCP (quality control), and echo cancellation/jitter buffering.
- The most crucial characteristic of voice AI is that audio arrives as a continuous stream, allowing the model to simultaneously transcribe, infer, call tools, and generate speech while the user is speaking – creating the difference between a ‘conversational’ and a ‘push-to-talk’ feel.
- The traditional WebRTC server approach, SFU (Selective Forwarding Unit), requires opening a separate port for each session, and this ‘one port per session’ model was a core problem colliding with Kubernetes at OpenAI’s scale, making horizontal scaling difficult due to stateful ICE/DTLS sessions needing to be pinned to specific nodes.
- To solve this, OpenAI designed a relay + transceiver split architecture, placing relays at the global edge to minimize first-hop latency to clients, while transceivers handle actual media processing and model connections within the internal infrastructure.
- Clients experience standard WebRTC behavior, while the underlying packet routing is completely changed, using ICE credentials to route to the correct transceiver and maintain stateful sessions.
- Combining global relays with geostering (automatic routing based on user location) ensures that connections are routed to the nearest relay worldwide, which is critical for maintaining low latency at a scale of 900 million users.
- The implementation leveraged the open-source Go WebRTC library Pion (https://github.com/pion/webrtc), and Pion’s creator, Sean DuBois, has since joined OpenAI.
- Currently, the Realtime API’s voice models are limited to the GPT-4o family, meaning the model’s capabilities aren’t at the level of the latest frontier models despite the architectural improvements.
Evidence
- "Pion library developers commented thanking OpenAI for publicly acknowledging its use and recommended 'WebRTC for the Curious (webrtcforthecurious.com)' as a WebRTC introductory resource. A WebRTC + Kubernetes game streaming product veteran strongly disagreed, arguing that the problems OpenAI described were mostly issues with the libwebrtc implementation, and that proper feature flag configuration could reduce latency without paid network workarounds. Users shared experiences where low latency itself created UX problems, with the system incorrectly interpreting pauses as turn endings. OpenAI mentioned their open-source Voice AI pipeline framework pipecat (https://github.com/pipecat-ai/pipecat), with comments recommending it as a good starting point. Questions arose about whether OpenAI replaced LiveKit with a custom WebRTC stack, but the architecture explanation itself implied a custom build."
How to Apply
- If you’re running WebRTC servers in Kubernetes and facing scale-out limitations due to the one-port-per-session problem, consider redesigning your architecture with a relay (edge, stateless) and transceiver (internal, stateful) split, routing based on ICE credentials.
- To quickly prototype real-time voice AI services, explore pipecat (https://github.com/pipecat-ai/pipecat) or Pion (https://github.com/pion/webrtc) before implementing a WebRTC stack from scratch, allowing you to start quickly without low-level implementation.
- When implementing ‘end-of-turn detection’ logic for Voice AI, avoid relying solely on silence timers, as they can prematurely cut off users pausing to find a word; instead, make the silence threshold user-adjustable or design separate logic to distinguish mid-utterance pauses from turn endings.
- If you’re operating WebRTC based on libwebrtc, consider checking feature flag settings, as latency issues may be solvable through configuration before resorting to paid network solutions or complex infrastructure changes.
Terminology
Related Papers
AI Compute Extensions (ACE) Specification
x86 Ecosystem Advisory Group이 행렬 곱셈과 저정밀도 데이터 포맷을 하드웨어 수준에서 가속하는 새로운 x86 명령어 확장 스펙 ACE를 공개했다. ML 워크로드를 CPU에서 더 효율적으로 돌리기 위한 ISA(명령어 집합 구조) 수준의 변화라 향후 AI 추론 환경에 영향을 줄 수 있다.
Show HN: High-Res Neural Cellular Automata
EPFL과 Google Research가 공동 개발한 Neural Cellular Automata(NCA)를 고해상도로 확장하는 기법으로, 기존 NCA의 해상도 한계를 경량 신경망 디코더로 극복한 SIGGRAPH 2026 논문이다.
Claude: Elevated errors across many models [resolved]
2026년 6월 16일 약 2시간 동안 Claude의 Sonnet, Opus, Haiku 모델 전반에 걸쳐 10% 수준의 오류율이 발생한 인시던트 보고서. Claude API에 의존하는 서비스 운영자에게 장애 대응 방식과 신뢰성 문제를 다시 생각하게 만드는 사건.
Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?
Hacker News에서 Claude/GPT를 로컬 LLM으로 완전 대체한 개발자들의 실제 셋업과 성능 경험담을 공유한 스레드로, Qwen3.6 35B를 중심으로 구체적인 하드웨어·속도·한계점까지 담겨 있어 로컬 AI 코딩 도입을 고민하는 개발자에게 현실적인 참고 자료가 된다.
Show HN: Script to bulk delete Claude chats from the web UI
claude.ai의 '전체 선택' 버튼이 화면에 보이는 항목만 선택하는 한계를 내부 API를 직접 호출해 우회하는 스크립트로, 모든 대화를 한 번에 삭제할 수 있다.
DiffusionGemma: 4x Faster Text Generation
Google이 토큰을 순차적으로 생성하는 기존 LLM 방식 대신 256토큰 블록을 한 번에 생성하는 diffusion 방식으로 최대 4배 빠른 추론 속도를 달성한 오픈 실험 모델 DiffusionGemma를 공개했다. Apache 2.0 라이선스로 배포되며 소비자용 GPU에서도 실행 가능해 엣지 디바이스와 실시간 인터랙티브 워크플로우에 새로운 가능성을 열어준다.