Why WebRTC Is the Problem for Voice AI (and What Comes Next)
WebRTC is the de facto standard for real-time voice, but as OpenAI and others scale voice AI, its complexity and transport limitations are becoming a serious bottleneck—here's why and what comes next.
If you've ever tried to build a voice AI app—especially one that needs low-latency, streaming audio—you've almost certainly run into WebRTC. It's the protocol stack that powers video calls, browser-based chat, and increasingly, the real-time APIs from OpenAI and others. But a growing number of developers argue that WebRTC is itself the problem. A recent Hacker News post titled "OpenAI's WebRTC problem" (based on this article) sparked a heated debate.
Why WebRTC's Transport Layer Hurts Voice AI
The original article argues that WebRTC is fundamentally ill-suited for cloud-based voice AI. The author points out that WebRTC's transport layer—built around peer-to-peer assumptions, SDP negotiation, and complex handshakes—adds latency and overhead that torpedo the user experience in voice AI scenarios. Instead, they advocate for simpler, more modern alternatives like WebTransport and WebCodecs, which offer lower overhead and better integration with cloud architectures.
Why WebRTC's Complexity Frustrates Voice AI Developers
The community response is split. Many developers who have implemented WebRTC share the pain. One commenter wrote:
"There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time."
Others, however, push back. Another commenter noted:
"I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic... when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat."
The Latency Tradeoff: WebRTC vs. WebTransport
I've built my share of real-time voice systems, and I have mixed feelings. WebRTC is undeniably a beast. The handshake is a nightmare of SDP munging and ICE trickling. But the alternative—stitching together raw WebCodecs with a custom transport—is even harder. The real issue isn't WebRTC per se; it's that WebRTC was designed for peer-to-peer video calls, not cloud-based AI inference where the "peer" is a data centre.
What we're seeing is the classic tension between a mature, battle-tested protocol and the needs of a new application domain. WebRTC's audio DSP pipeline is excellent. Its NAT traversal is unmatched. But its transport assumptions (bidirectional symmetric flows, bundling of media and control) don't match the pattern of a voice AI client sending audio up to a server and receiving responses back.
Consider the scenario: a user speaks, the audio is streamed to a server, the server runs inference (which might have variable latency), and then streams the response back. WebRTC forces you to maintain a real-time channel even when no data is flowing, and its built-in jitter buffers and pacing add unnecessary delay. As one commenter put it:
"This is the opposite of the feedback I get. Users want instant responses... If you have delay in generating responses/interruptions it kills the magic."
The irony is that WebRTC's reliability features—designed to prevent glitches in video calls—actually harm the voice AI UX.
What Voice AI Builders Should Do Now
If you're building a voice AI agent today, you have a choice:
-
Stick with WebRTC and leverage libraries like Pipecat that abstract away the complexity. You get battle-tested audio processing and broad device support, but you'll fight the transport layer.
-
Go lower-level with WebTransport and WebCodecs. This gives you full control over the transport—you can use WebTransport datagrams for unreliable media delivery and reliable streams for control—but you lose WebRTC's built-in audio DSP (echo cancellation, noise suppression, VAD). You'll need to implement or integrate those separately (e.g., using RNNoise or a cloud service).
A hybrid approach is emerging: use WebRTC for the audio processing but replace its transport with a WebTransport-based channel. That's what some large-scale providers are doing internally, though few talk about it publicly.
Here's a minimal example of what a WebTransport-based audio send might look like (browser side):
const transport = new WebTransport('https://example.com/audio');
await transport.ready;
const audioStream = await transport.createUnidirectionalStream();
const writer = audioStream.getWriter();
// Get audio data from getUserMedia and encode
// ...
// Send chunks as soon as available
await writer.write(chunk);
The server receives each chunk immediately, without waiting for jitter buffers or ICE negotiation. For inference responses, you can use a bidirectional stream with reliable delivery for control frames and unreliable for audio playback.
Another promising path is to use HTTP/2 or HTTP/3's server push to stream audio responses. One commenter mentioned that for Alexa, they kept a persistent connection open over a custom HTTP-like protocol:
"For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire... This allowed the STT start processing before you finish talking."
That pattern—keep a long-lived HTTP connection and use it for bidirectional streaming—is essentially what WebTransport formalizes.
Is WebTransport the Future for Voice AI?
If you're building voice AI applications that need sub-500ms end-to-end latency, yes. WebRTC's overhead will hurt you. If you're prototyping or targeting browser consumers only, WebRTC is still your safest bet for ubiquity. But if you're operating at scale (thousands of concurrent users) or need to optimize every millisecond of latency, it's time to explore WebTransport and WebCodecs. The ecosystem is still young, but the direction is clear: the future of real-time voice AI will not be built on WebRTC's transport layer.
Links: HN discussion, Original article, WebTransport spec, WebCodecs spec, Pipecat, MDN WebRTC guide, RNNoise.